SPECTRAL NORM OF PRODUCTS OF RANDOM AND 
DETERMINISTIC MATRICES 



ROMAN VERSHYNIN 

Abstract. We study the spectral norm of matrices W that can be factored 
as W = BA, where A is a random matrix with independent mean zero 
entries and B is a fixed matrix. Under the (4 + e)-th moment assumption 
on the entries of A, we show that the spectral norm of such an m x n matrix 
W is bounded by y/m + y/n, which is sharp. In other words, in regard to 
the spectral norm, products of random and deterministic matrices behave 
similarly to random matrices with independent entries. This result along 
with the previous work of M. Rudelson and the author implies that the 
smallest singular value of a random m x n matrix with i.i.d. mean zero 
entries and bounded (4 + e)-th moment is bounded below by \fm — \jn—\ 
with high probability. 



1. Introduction 

This paper grew out of an attempt to understand the class of random matri- 
ces with non-independent entries, but which can be factorized through random 
matrices with independent entries. Equivalently, we are interested in sample 
covariance matrices of a wide class of random vectors - the linear transforma- 
tions of vectors with independent entries. 

Here we study the spectral norm of such matrices. Recall that the spectral 
norm \\W\\ is defined as the largest singular value of a matrix W, which equals 
the largest eigenvalue of \/WW*. Equivalently, the spectral norm can be 
defined as the £2 — > £2 operator norm: \\W\\ = sup x \\Wx\\2/\\x\\2 where || • H2 
denotes the Euclidean norm. The spectral norm of random matrices plays a 
notable role in particular in geometric functional analysis, computer science, 
statistical physics, and signal processing. 

1.1. Matrices with independent entries. For random matrices with inde- 
pendent and identically distributed entries, the spectral norm is well studied. 
Let W be an m x n matrix whose entries are real independent and identically 
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distributed random variables with mean zero, variance 1 and finite fourth mo- 
ment. Estimates of the type 

(1.1) ||W|| ~ Vn + y/m 

are known to hold (and are sharp) in both the limit regime for dimensions 
increasing to infinity, and the non-limit regime where the dimensions are fixed. 
The meaning of (11. ip in the limit regime is that, for a family of matrices as 
above whose dimensions m and n increase to infinity and whose aspect ratio 
m/n converges to a constant, the ratio ||VT||/(\/n + \/m) converges to 1 almost 
surely [33] . 

In the non-limit regime, i.e. for arbitrary dimensions n and m, variants of 
(11.11) were proved by Y. Seginer [29] and R. Latala [T7j. If W is an m x n 
matrix whose entries are i.i.d. mean zero random variables, then denoting the 
rows of W by Xj and the columns by Yj, the result of Y. Seginer states 
that 

E||W|| < C (E max H^Ha + E maxilla) 
% o 

where C is an absolute constant. This estimate is sharp because ||W|| is 
obviously bounded below by the Euclidean norm of any row and any column 
of W. Furthermore, if the entries Wij of the matrix W are not necessarily 
identically distributed, then R. Latala's result p2] states that 

E||W|| < C(maxE||X i || 2 + maxE||Y J -|| 2 + (^Eu>J) 1/4 ). 

1,3 

In particular, if W is an m x n matrix whose entries are independent random 
variables with mean zero and fourth moments bounded by 1, then one can 
deduce from either Y. Seginer's or R. Latala's result that 

(1.2) ¥.\\W\\<C{Vn + Vm). 

This is a variant of (11. ip in the non-limit regime. 

The fourth moment hypothesis is known to be necessary. Consider again a 
family of matrices whose dimensions m and n increase to infinity, and whose 
aspect ratio m/n converges to a constant. If the entries are independent and 
identically distributed random variables with mean zero and infinite fourth 
moment, then the upper limit of the ratio ||W||/(-y/n + \fm) is infinite almost 
surely [33] . 

1.2. The main result. The main result of this paper is an extension of the 
optimal bound (II .2p to the class of random matrices with non-independent 
entries, but which can be factored through a matrix with independent entries. 

Theorem 1.1. Let e G (0, 1) and let m, n, N be positive integers. Consider a 
random m x n matrix W = BA, where A is an N x n random matrix whose 
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entries are independent random variables with mean zero and (A+e)-th moment 
bounded by 1, and B is anmxN non-random matrix such that \\B\\ < 1. Then 

(1.3) E\\W\\<C(e)(V^+Vm) 
where C(e) is a function that depends only on e. 

Remarks. 1. An important feature of this result is that its conclusion is inde- 
pendent of the dimension N. 

2. The proof of Theorem 11.11 yields the stronger estimate 

(1.4) E\\W\\ < C(s)(\\B\\^+ \\B\\b3) 

valid for arbitrary (non-random) m x N matrix B. This result is independent 
of the dimensions of the matrix B, and therefore it holds for an arbitrary linear 
operator B acting from the N- dimensional Euclidean space ^ t° an arbitrary 
Hilbert space. 

3. Theorem 11.11 can be interpreted in terms of sample covariance matrices 
of random vectors in M m of the form BX, where X is a random vector in 
M. N with independent entries. Indeed, let A be the random matrix whose 
columns are n independent samples of the vector X. Then W = BA is the 
matrix whose columns are n independent samples of the random vector BX. 
The sample covariance matrix of the random vector BX is defined as E = 
^WW*. Theorem 11.11 states that the largest eigenvalue of S is bounded by 
Ci(e)(l + m/n), which is further bounded by C^ie) for the number of samples 
n >m (and independently of the dimension N). This problem was previously 
studied in [I], [5] in the limit regime for m — N, where the result must of 
course depend on N. 

4. Under the stronger subgaussian moment assumption ( II. (jp on the entries, 
Theorem 11.11 is easy to prove using standard concentration and an e-net argu- 
ment. In contrast, if only some finite moment is assumed, we do not know any 
simple proof. 

1.3. The smallest singular value. Our main motivation for Theorem 11.11 
was to complete the analysis of the smallest singular value of random rect- 
angular matrices carried out by M. Rudelson and the author in [28]. The 
smallest singular value s m i n (W) of a matrix W can be equivalently described 
as s miQ (W)=mf x \\Wx\\ 2 /\\x\\ 2 . 

Analyzing the smallest singular value is generally harder than analyzing the 
largest one (the spectral norm) . The analogue of ( II. ip for the smallest singular 
value of random m x n matrices W (for m > n) is 



(1.5) 
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The optimal limit version of this result proved in [7] holds under exactly the 
same hypotheses as (II. ip - for i.i.d. entries with mean zero, variance 1 and 
finite fourth moment. 

Many papers addressed (jl.5p for fixed dimensions n, m. Sufficiently tall 
matrices (m > Cn for sufficiently large C) were studied in [8]; extensions to 
genuinely rectangular matrices (m > (1 + e)n for some e > 0) were studied in 
[20] [2J [23] , with gradually improving dependence on e. An optimal version of 
( II. 5p for all dimensions was obtained in [28]. All these works put somewhat 
stronger moment assumptions than the fourth moment of the entries of the 
matrix W . A convenient assumption is that the entries Wij are subgaussian 
random variables. This means that all their moments are bounded by the 
corresponding moments of the standard normal random variable, i.e. 

(1.6) (E\ Wlj \ p ) 1/p < My/p for all p > 1 

where M is called the subgaussian moment. It was proved in [28] that if 
the entries of W are i.i.d. mean zero subgaussian random variables with unit 
variance, then for every t > one has 

(1.7) P(s min (W) <t{yM~ V^l)) < (Ct) m - n+1 + e~ cm 

where C, c > depend only on the subgaussian moment M. In particular, for 
such matrices we have 

(1.8) s m i n (W) > Ci(y/m — \/n — 1) with high probability 

where c\ > depends only on the desired probability and the subgaussian 
moment. This result encompasses the case of square matrices where m = n 
and hence (I1.8P yields s m i n (W) > cil\fn. For Gaussian square matrices this 
optimal bound was obtained in [TT] and [30J; for general square matrices a 
weaker bound n~ 3 / 2 was obtained in [25] and the best bound as above in [26] ; 
the estimate is shown to be optimal in [27] . 

Whether (II. 8p holds under weaker moment assumptions was only known in 
the case of square matrices. It was proved in [26] using (II. 2p that (11.81) holds 
under the fourth moment assumption for square matrices, i.e. for m = n. 
Whether the same is true for arbitrary rectangular matrices under the fourth 
moment assumption was left open in [28J. The bottleneck of the argument 
occurred in Proposition 7.3 on [2S] where we needed a correct bound on the 
spectral norm of a product of a random matrix and a fixed orthogonal projec- 
tion. Such a bound was easy to get only under the subgaussian hypothesis. 
Theorem 1 1.1 1 of the present paper extends the argument of [28] for random ma- 
trices with bounded (4 + e)-th moment. It follows directly from the argument 
of [28] and Theorem 11.11 

Corollary 1.2 (Smallest singular value). Let e e (0, 1) andm > n be positive 
integers. Let A be a random m x n matrix whose entries are i.i.d. random 
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variables with mean zero, unit variance and (4 + e)-th moment bounded by M. 
Then, for every 5 > there exist t > and n which depend only on e, 5 and 
M, and such that 

^(s mm iA) < t(y/rn — \/n — l) J < 5 for all n > uq. 

This result follows by the argument in [28], where one considers probability 
estimates conditional on the event that the norm of a product W of a random 
matrix and a non-random orthogonal projection is small (see [281 Proposi- 
tion 7.3]). 

After this paper was written, two important related results appeared on the 
universality of the smallest singular value in two extreme regimes - for almost 
square matrices and for genuinely rectangular matrices. One of these results, 
by T. Tao and V. Vu [32] works for square and almost square matrices where 
the the defect m — n is constant. It is valid for matrices with i.i.d. entries with 
mean zero, unit variance and bounded C-th moment where C is a sufficiently 
large absolute constant. The result states that the smallest singular value of 
such mxn matrices A is asymptotically the same as of the Gaussian matrix G 
of the same dimensions and with i.i.d. standard normal entries. Specifically, 

(1.9) P(ms min (G) 2 < t - m- c ) - m~ c < F(ms min {A) 2 < t) 

< P(ms min (G) 2 < t + m' c ) + mT c . 

This universality result, combined with the known asymptotic estimates of 
the smallest singular value of Gaussian matrices s m i n (G) allows one to obtain 
bounds sharper than in Corollary 11.21 However, the universality result of [32] 
is only known in the almost square regime m — n = 0(1) (and under stronger 
moment assumptions), while Corollary 11.21 is valid for all dimensions m > n. 

Another recent universality result was obtained by O. Feldheim and S. Sodin 
[T2] for genuinely rectangular matrices, i.e. with aspect ratio m/n separated 
from 1 by a constant, and with subgaussian i.i.d. entries. In particular they 
proved the inequality 

C 

(1.10) F(s min {A) < (Vm - v 7 ^) 2 - tm) < exp(~cnt 3/2 ). 

1 — wm/n 

Deviation inequalities (jl.7]l and fll.lOp complement each other - the former 
is multiplicative (and is valid for arbitrary dimensions) while the latter is 
additive (and is applicable for genuinely rectangular matrices). Each of these 
two inequalities clearly has the regime where it is stronger. 

1.4. Outline of the argument. Let us sketch the proof of Theorem ll.il We 

can assume that m = n by adding an appropriate number of zero columns 
to A or rows to B. Since the columns of A are independent, the columns 
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Xi, . . . , X n of the matrix W are independent random vectors in IR n . We would 
like to bound the spectral norm of WW* = J2jXj ® Xj, which is a sum of 
independent random operators. For random vectors Xj uniformly distributed 
in convex bodies, deviation inequalities for sums J2j Xj ® Xj were studied in 
[T5l [T0| 1221 [TH [2T| [3j [1] . For general distributions, a sharp estimate for such 
sums has been proved by M. Rudelson [22J. This approach, which we develop 
in Section [3j leads us to the bound 

(1.11) E||W|| < Cy/n logn. 

This bound is already independent of the dimension N , but is off by ylogn 
from being optimal. The logarithmic term is unfortunately a limitation of 
this method. This term comes from M. Rudelson's result, Theorem 13.11 be- 
low, where it is needed in full generality. It would be useful to understand the 
situations where the logarithmic term can be removed from M. Rudelson's the- 
orem. So far, only one such situation is known from [1] where the independent 
random vectors Xj are uniformly distributed in a convex body. 

In absence of a suitable variant of M. Rudelson's theorem without the loga- 
rithmic term, the rest of our argument will proceed to remove this term from 
( II. lip using the rich independence structure, which is inherited by the vectors 
Xj from the random matrix A. However, the independence structure is en- 
coded nontrivially via the linear transformation B, which makes the entries of 
Xj dependent). A more delicate application of M. Rudelson's theorem allows 
one to transfer the logarithmic term from the conclusion to the assumption. 
Namely, Theorem 13.91 establishes the optimal bound E||VK|| < C^fn in the case 
when all columns of B are logarithmically small, i.e. their Euclidean norm is 
at most \og~°^ n. While some columns of a general matrix B may be large, 
the boundedness of B implies that most columns are always logarithmically 
small - all but all but n log ^ n of them. So, we can remove from B the 
already controlled small columns, which will make B an almost square matrix. 
In other words, we can assume hereafter that N = n log ^ n. 

The advantage of almost square matrices is that the magnitude of their 
entries is easy to control. A simple consequence of the (4 + e)-th moment 
hypothesis and Markov's inequality yields that the entries of A = (ay) satisfy 
maxj j \dij\ < y/n with high probability. Note that the same estimate holds for 
square matrices (N = n) under the fourth moment assumption. So, in regard 
to the magnitude of entries, almost square matrices are similar to exactly 
square matrices, for which the desired bound follows from R. Latala's result 

This prompts us to construct the proof of Theorem 11.11 for almost square 
matrices similarly to R. Latala's argument in [17], i.e. using fairly standard 
concentration of measure results in the Gauss space, coupled with delicate 



7 



constructions of nets. We first decompose A into a sum of matrices which con- 
tain entries of similar magnitude. As the magnitude increases, these matrices 
become sparser. This quickly reduces the problem to random sparse matrices, 
whose entries are i.i.d. random variables valued in { — 1,0,1}. The spectral 
norm of random sparse matrices was studied in [16] as a development of the 
work of Z. Furedi and J. Komlos [13]. However, we need to bound the spec- 
tral norm of the matrix W = BA rather than A. Independence of entries is 
not available for W, which makes it difficult to use the known combinatorial 
methods based on the bounding trace of high powers of W. 

To summarize, at this point we have an almost square random sparse matrix 
A, and we need to bound the spectral norm of W = BA, which is ||W|| = 
sup x ||Wa;||2, where the supremum is over all unit vectors x G W 1 . The well 
known method is to first fix x and bound ||Wa;||2 with high probability; then 
take a union bound over all a; in a sufficiently fine net of the unit sphere of 
MJ 1 . However, a probability bound for every fixed vector x, which follows from 
standard concentration inequalities, is not strong enough to make this method 
work. Sparse vectors - those which have few but large nonzero coordinates 
- produce worse concentration bounds than spread vectors, which have many 
but small nonzero coordinates. What helps us is that there are fewer sparse 
vectors on the sphere than there are spread vectors. This leads to a tradeoff 
between concentration and entropy, i.e. between the probability with which 
||Wa;||2 is nicely bounded, and the size of a net for the vectors x which achieve 
this probability bound. One then divides the unit Euclidean sphere in W 1 
into classes of vectors according to their "sparsity", and uses the entropy- 
concentration tradeoff for each class separately. This general line is already 
present in Latala's argument [17], and it was developed extensively in the 
recent years, see e.g. [SUl ESI ESI EB]- This argument is presented in Section 0], 
where it leads to a useful estimate for norms of sparse matrices, Corollary 14.91 
With this in hand, one can quickly finish the proof of Theorem 11.11 

Acknowledgement. The author is grateful for the referee for careful read- 
ing of the manuscript, and for many suggestions which greatly improved the 
presentation. 

2. Preliminaries 

2.1. Notation. Throughout the paper, the results are stated and proved over 
the field of real numbers. They are easy to generalize to complex numbers. 

We denote by C,C\,c,c\. . . positive absolute constants, and by C(e), Ci(e), . . . 
positive quantities that may depend only on the parameter e. Their values can 
change from line to line. 

The standard inner product in R n is denoted (x, y). For a vector x G M n , we 
denote the cardinality of its support by ||x|| = \{j : Xj ^ 0}|, the Euclidean 
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norm by ||s|| 2 = C^jZ 2 ) 1 ^ 2 , and the sup-norm by ||:r||oo — niax^- \xj\. The 
unit Euclidean ball in R™ is denoted by = {x : ||x||2 < 1}, and the unit 
Euclidean sphere in R n is denoted by S 1 ™ -1 = {x : ||x||2 = 1}. 

The tensor product of vectors x, y G R n is the linear operator i ® t/ on 1" 
defined clS {x <g) y){z) = (x, z)y for z G R n . 

2.2. Concentration of measure. The method that we carry out in Section [4] 
uses concentration in the Gauss space in combination with constructions of e- 
nets. Here we recall some basic facts we need. 

The standard Gaussian random vector g G R m is a random vector whose 
coordinates are independent standard normal random variables. The following 
concentration inequality can be found e.g. in [TjJJ inequality (1.5)]. 

Theorem 2.1 (Gaussian concentration). Let f : IR m — > R be a Lipschitz 
function. Let g be a standard Gaussian random vector in R m . Then for every 
t > one has 

F(f(g)-Ef(g)>t)<exp(-c t 2 /\\f\\l p ) 

where Co G (0, 1) is an absolute constant. 

As a very restrictive but useful example, Theorem 12.11 implies the following 
deviation inequality for sums of independent exponential random variables 
g\ (which can also be derived by the more standard approach via moment 
generating functions). 

Corollary 2.2 (Sums of exponential random variables). Let d = (di, . . . , d m ) 
be a vector of real numbers, and let g±, . . . , g m be independent standard normal 
random variables. Then, for every t > we have 

m 1/2 

> \\dh + t) < exp(-c t 2 /||rf||L). 

i=i 

Proof. The function f(y) = (52iLi dfyf) 1 ^ 2 is a Lipschitz function on R m with 
H/Hup = IMIloo- Moreover, Holder's inequality implies that 

mg) = e(e«) 1/2 - ( E E«) V2 = lldh - 

i=l i=l 

Theorem 12.11 completes the proof. □ 

Another classical deviation inequality we will need is Bennett's inequality, 
see e.g. [9j Theorem 2]: 
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Theorem 2.3 (Bennett's inequality). Let X±, . . . , X^ be independent mean 
zero random variables such that \X^\ < 1 for all i. Consider the sum S = 
X\ + • • • + Xn and let a 1 := Var(S'). Then, for every t > we have 

F{S > t) < exp ( - a 2 h{t/a 2 )) 

where h{u) — (1 + u) log(l + u) — u. 

We will also need M. Talagrand's concentration inequality for convex Lip- 
schitz funcitons from [2U Theorem 6.6]; see also [TSJ Corollary 4.10] and the 
discussion below it. 

Theorem 2.4 (Concentration of Lipschitz convex functions). Let X\, . . . , X m 

be independent random variables such that \Xi\ < K for all i. Let f : R m — > R 
be a convex and 1- Lipschitz function. Then for every t > one has 

F(\f(X 1 , ...,X m )- Ef(X 1 , ...,X m )\>Kt)< 4exp(-t 2 /4). 

2.3. Nets. Consider a subset U of a normed space X, and let e > 0. Recall 
that an e-net of U is a subset H olU such that the distance from any point 
of U to M is at most e. In other words, for every x E U there exists y E M 
such that ||s — y||x < e. 

The following estimate follows by a volumetric argument, see e.g. the proof 
of Lemma 9.5 in [T9"] . 

Lemma 2.5 (Cardinality of e-nets). Let e G (0, 1). TTie unit Euclidean ball 
B% and the unit Euclidean sphere S n ~ l in R n both have e-nets of cardinality 
at most (1 + 2/e) n . 

When computing norms of linear operators, e-nets provide a convenient 
discretization of the problem. We formalize it in the next proposition. 

Proposition 2.6 (Computing norms on nets). Let A : X — > Y be a linear 
operator between normed spaces X and Y , and let M be an e-net of either the 
unit sphere S(X) or the unit ball B(X) of X for some e G (0, 1). Then 

\\A\\ < sup ||v4a;||y. 

1 — £ x&M 

Proof. We give the proof for an e-net of the unit sphere; the case of the unit 
ball is similar. Every z G S(X) has the form z — x + h, where x G M and 
ll^-IU < £• Since \\A\\ = sup zeS ^ \\Az\\ Y , the triangle inequality yields 

||j4|| < sup ||Ae||y + sup ||A/i||y. 

^ \\h\\x<e 

The last term in the right hand side is bounded by e||A||. Thus we have shown 
that 

(l-e)||A|| < sup||Ac|| y . 
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This completes the proof. □ 



2.4. Symmetrization. We will use the standard symmetrization technique 
as was done in [17]; see more general inequalities in e.g. [191 Section 6.1]. To 
this end, let the matrices A = (a^) and B be as in Theorem ll.il Let A' = (oL) 
be an independent copy of A, and let be independent symmetric Bernoulli 
random variables. Then, by Jensen's inequality, 

E\\BA\\ = E\\B{A - EA')\\ < E\\B(A - A')\\ 

= E||S(e y (a - - a' i3 ))\\ < 2E||S(e -a -)||. 

Therefore, we can assume without loss of generality in Theorem 11.11 that 
are symmetric random variables. Furthermore, let be independent standard 
normal random variables. Then, again by Jensen's inequality, 

E\\B(g tJ a tJ )\\ = E||£^y|</iiki)|| > E||S( £ii E(|^|)ay)|| 
= (2/n) 1 / 2 E\\B(e tJ a tj )\\. 

Therefore 

(2.1) E\\BA\\ < (27r) 1 / 2 E\\B(g ij a ij )\\. 

Conditioning on a^-, we thus reduce the problem to random gaussian matrices. 

We will use a similar symmetrization technique several times in our argu- 
ment. In particular, in the proof of Lemma 13.81 we apply the following ob- 
servation, which can be deduced from standard symmetrization lemma ( |19j 
Lemma 6.3) and the contraction principle ([19j Theorem 4.4). For the reader's 
convenience we include a direct proof. 

Lemma 2.7 (Symmetrization). Consider independent mean zero random vari- 
ables Zij such that \Z±j\ < 1, independent symmetric Bernoulli random vari- 
ables Eij, and vectors Xij in some Banach space, where both i and j range in 
some finite index sets. Then 



E max 1 1 Z 



ij Xij 



< 27.. max 



Proof. To be specific, we can assume that both indices i and j range in the 
interval {1, . . . , n} for some integer n. Let (Z^) denote an independent copy of 
the sequence of random variables {Zij). Then Z^ — Z[j are symmetric random 
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variables. We have 
Emax ZijXij < Emax ^J(Zy — EJ5y):cy (since EZ^- = 0) 



ijJ X ij 



< Emax y (Zij — Z\ 

i 

= Emax \ £ij(Zij — Z[ 
j II t-r 1 

< 2 max E max |S e 

|ay|<l j II 



ijJ X ij 



(by Jensen's inequality) 
(by symmetry) 



where the last line follows because \Z^ — Z' iA \ < \Za \ + \ Z'-\ < 2. The function 



on 



i a ij)lj=i ^ Emax 



is a convex function. Therefore, on the compact convex set [—1,1]™ it attains 
its maximum on the extreme points, where all ay = ±1. By symmetry, the 
function takes the same value at each extreme point, which equals 



Emax e 



This completes the proof. 



□ 



2.5. Truncation and conditioning. We will need some elementary obser- 
vations related to truncation and conditioning of random variables. 

Lemma 2.8 (Truncation). Let X be a non-negative random variable, and let 
M > 0, p > 1. Then 

EX p 



EX1 



{X>M} 



< 



MP- 1 ' 

Proof. Indeed, 

EX1 {X > M} < EX(X/M) p - 1 l {x > A/} < EX p /M- 
The Lemma is proved. 



p-i 



□ 



We will also need two elementary conditioning lemmas. In Section HI we will 
need to control the maximal magnitude of the entries M = maxy |a y -| of the 
random matrix A. Conditioning on Mq will unfortunately destroy the inde- 
pendence of the entries. So, we will instead condition on an event {Mo < t] for 
fixed t, which will clearly preserve the independence. This conditional argu- 
ment used in the proof of Corollary 14. 1 II relies on the following two elementary 
lemmas. 
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Lemma 2.9. Let X be a random variable and K be a real number. Then 

E(X | X < K) < EX. 

Proof. By the law of total probability, 

EX = E(X | X < K) P(X < K) + E(X | X > K) P(X > K). 

Thus EX is a convex combination of the numbers a = E(X | X < X) and 
6 = E(X | X > K). Since clearly a < K < b, we must have a < EX < b. □ 

Lemma 2.10. Lei X , Y be non-negative random variables. Assume there 
exists K, L > such that one has for every t > 1 : 

(2.2) E(X 2 \Y <t)< K 2 t, P(Y > Lt) < ^ 
T/ien EX < CKVL. 

Proof. Without loss of generality we can assume that K — 1 by rescaling X 
to X/K. Thus we have for every t > 1: 

(2.3) EX 2 l {y < t} < E(X 2 |F < t) < t, 
We consider the decomposition 

oo 

EX = EXl{y< L } + y^EXl{ 2 fc-iL<y<2 fc L}- 

k=l 

By (12. 3 p and Holder's inequality, the first term is bounded as 
EXl {y < L} < (EX 2 l {y < L} ) 1/2 < VI. 

Further terms can be estimated by Cauchy-Schwarz inequality and using (12. 3 h 
and the second inequality in (12. 2p . Indeed, 

EXl| 2 fc-ii<y<2fcL} = EXl{y< 2 fcL}l{y >2 fc-iL} 

<(EX 2 l {y < 2 , L} ) 1/2 (P{F>2 fe - 1 L}) 1/2 
<( 2 fe L) 1 / 2 -^ T = v / L2 1 - fc / 2 . 



Therefore 



ex <Vl + J2^l 2 l ~ k/2 < cVI. 

k=l 

This completes the proof. □ 
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2.6. On the deterministic matrix B in Theorem 11.11 We start with two 
initial observations that will make our proof of Theorem 11.11 more transparent. 
By adding an appropriate number of zero rows to B or zero columns to A we 
can assume without loss of generality that n = m, thus B is an n x N matrix. 

Throughout the proof of Theorem II. 1[ we shall denote the columns of such 
a matrix B by B\, . . . , B^. They are non-random vectors in W 1 , which satisfy 

N 

(2.4) maxilla < < 1; ^ \\B t \\ 2 2 = \\B\\^ S < n\\B\\ < n 

i=l 

where || • ||hs denotes the Hilbert-Schmidt norm. Throughout the argument, we 
will only have access to the matrix B through inequalities (12 .4p . This explains 
Remark 2 following Theorem 11.11 which states that the range space of B is 
irrelevant as long as we control the spectral and Hilbert-Schmidt norms of B. 



3. Approach via M. Rudelson's theorem 

3.1. M. Rudelson's theorem. Our first approach, which will yield Theo- 
rem 11.11 up to a logarithmic factor, rests on the following result. Here and 
thereafter, by £i,£2, ••■ we denote independent symmetric Bernoulli random 
variables, i.e. independent random variables such that P(£j = ±1) = 1/2. 

Theorem 3.1 (M. Rudelson [22]). Let ui, . . . , um be vectors in IR m . Then, for 
every p > 1, one has 



M P i/p M 

m) ■ max ||ttj|| 2 ■ ® Uj 

8=1 ' i I 



1/2 



In particular, for every t > 0, with probability at least 1 — 2me ct one has 

M M 

~ Ui 



x ^ 1/2 

2^£iUi®Ui < t ■ max \\Ui\\2 ■ \\ / j Uj ® 
i=i ' i=i 



The first estimate is taken from [22l inequality (3.4)]. The second estimate 
can be easily derived from it using the following elementary lemma: 

Lemma 3.2 (Moments and tails). Suppose a non-negative random variable X 
satisfies for some m > 1 that 



(KX p ) 1 / p < y/p + A/logm for every p > 1. 

Then 

W(X > t) < 2me" c * 2 for every t > 0. 
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Proof. Suppose first that t > max(l, y/\ogm). Let p := t 2 . Then y/p > 
yTogm, so the hypothesis gives (EX p ) 1 / p < 2y/p. By Markov's inequality, 



P(X > 2et) = P(X P > (2et) p ) < 



(2ei)i 



Next, if t < max(l, \J\og m) then by choosing the absolute constant c > 
sufficiently small right hand side of ( 13. ip is larger than 1 for a sufficiently 
small absolute constant c . Therefore, for every t > one has 

(3.1) P(X > 2et) < 2me"* 2/2 



because if t < max(l, \/log m) then the right hand side of (13.1 ft is larger than 
one, which makes the inequality trivial. This completes the proof. □ 

The next lemma is a consequence of M. Rudelson's Theorem 13.11 and a 
standard symmetrization argument. 

Lemma 3.3. Let X 1; . . . ,X n be independent random vectors in R m such that 



(3.2) 
Then 



\EX~ 



Xj\\<1 



for every j. 



E \\J2 X 



< Cn + C\og{2m) Emax ||X 



2 

J 1 1 2 - 



Proof. Let ei,...,e n be independent symmetric Bernoulli random variables. 
By the triangle inequality, the standard symmetrization argument (see e.g. 
[T9l Lemma 6.3]), and the assumption, we have 



EX, 

3=1 



X, 



n. 



E := E X i ® X i - E ® X i ~ EX i ® ^ 

< 2E|| J^Xj ® X 
j=i 

Condition on the random variables X%, . . . , X n , and apply Theorem 13. 11 Writ- 
ing E e to denote the conditional expectation (i.e. the expectation with respect 
to the random variables e\, . . . , e n ), we have 

|| n II n 1/2 

E £ £ j x j ® x j < C v / log(2m) • max \\Xj || 2 ■ X j <g> Xj 
i=i 3 j=i 

Now we take expectation with respect to Xi, . . . , X n and use Cauchy-Schwarz 
inequality to get 

E < Cyiog(2m) • (EmaxllX,-^) 172 .^ 1 / 2 + n. 



The conclusion of the lemma follows. 



□ 
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3.2. Theorem 11.11 up to a logarithmic term. We now state a version of 
Theorem 11.11 with a logarithmic factor. 

Proposition 3.4. Let N, n be positive integers. Consider an N x n random 
matrix A whose entries are independent random variables with mean zero and 
4-th moment bounded by 1. Let B be an n x N matrix such that \\B\\ < 1. 
Then 

E\\BA\\ < C^fn log(2n). 

The proof will need two auxiliary lemmas. Recall that B\, ... , B^ denote 
the columns of the matrix B. 

Lemma 3.5. Let ai, . . . , ajv be independent random variables with mean zero 
and 4-th moment bounded by 1. Consider the random vector X in W 1 defined 
as 

N 



X = Y,^Bi- 



i=i 

Then 

E||X||^<n, Var(||X||l) < 3n. 
Proof. The estimate on the expectation follows easily from (12.41) : 

N N 

(3.3) Eiixujj = YsHatmwi < wit ^ n - 

i=l i=l 

To estimate the variance, we need to compute 

TV 

E\\X\\* = E(X,X) 2 = Eia.ajaka^iB^B^iB^Bt). 

i,j,k,l=l 

By independence and the mean zero assumption, the only nonzero terms in 
this sum are those for which i — j;k — I or i — k;j — I or i — l;j — k. 
Therefore 

N N 

E\\X\\i = H&j)\W\l\\Bj\\l + 2 B 3 f 

i,j=l i,j=l 
N N N 

= J2 E «)\\ B ^ + E n*i)n^)\\B t \\i\\B 3 \\i + 2 e^^b^ 

i=l i,j=l i,j=l 

=: I x + h + h- 
By the fourth moment assumption and using (12.41) we have 

N N 

h<J2 W B iWi ^ maxdl^H^ Y \\Bi\\l < n 

i=l i=l 
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Squaring the sum in (13.31) . we see that 



h < (E||X|| 2 ) 2 . 



Finally, since by Cauchy-Schwarz inequality E(a 2 a|) < yE(a*)E(a*) < 1, and 
using ( 12. 4 p again, we obtain 

N 

h < 2Y,(B l ,B J ) 2 = 2||B*B||^ S < 2||fi*|| 2 || J B||^ s = 2||5|| 2 || J B|| 2 S < 2n. 

Putting all this together, we obtain 

Var(||X|| 2 ) = E\\X\\% - (E||X|| 2 ) 2 < h + I 3 < 3n. 
This completes the proof. □ 

Lemma 3.6. Let A and B be matrices as in Proposition^^ Let X 1; . . . , X n e 

W l denote the columns of the matrix BA. Then 

E max ll-XjUa < Cn. 

j=l,...,n 

Remark. This result says that all columns of the matrix BA have norm 0(yfn) 
with high probability. Since the spectral norm of a matrix is bounded below by 
the norm of any column, this result is a necessary step in proving our desired 
estimate \\BA\\ = 0(s/n). 

Proof. Let, as usual, Bi, . . . , Bn £ M n denote the columns of the matrix B, 
and let denote the entries of the matrix A. Then 

N 

(3.4) Xj = ^ a ij B ^ j = 1, . . . , n. 

Let us fix j G {1, . . . , n) and use Lemma 13.51 This gives 

(3.5) E||Xj 2 <n, Var(||X,-|| 2 ) < 3n. 

Now we use Chebychev's inequality, which states that for a random variable 
Z with a 2 = Var(Z) and for an arbitrary k > 0, one has 

¥>(\Z-EZ\ > ka) < —. 

K 

Let t > be arbitrary. Using Chebychev's inequality along with (I3.5P for 
Z = \\XjW 2 ,, k = t\/n, we obtain 



P(||XJ*>(l + >/3f)n) <i- 



t 2 n 

Taking the union bound over all j — 1, . . . , n, we conclude that 
P( max ||X 7 -|| 2 > (1 + V3t)n) < n ■ — = — . 

y j=l,...,n" J >" ' t 2 n t 2 



Integration completes the proof. 
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Proof of Proposition \3.4\ Let X±, . . . , X n e M. n denote the columns of the ma- 
trix BA. We are going to apply Lemma 13.31 In order to check that condition 
(13.21) holds, we consider an arbitrary vector x G S 1 ™ -1 and use representation 
(13.41) to compute 

N N N 

E(X v x) 2 = E(j2"ij(Bi,x)) = J2 E ( a D( B ^ x ) 2 ^ E^' x > 2 

i=l i=l i=l 

= \\B*x\\ 2 2 < \\B*\\ 2 = \\B\\ 2 < 1. 
This shows that condition (13. 2p holds. Lemma [3.31 then gives 



E||£A|| 2 = E 2^Xj ®Xj 



< Cn + C\og(2n)E max ||X,-||^ 

j=l,...,n 

Estimating the maximum in the right hand side using Lemma [3761 we conclude 
that 

EUS^II 2 < dnlog(2n). 
This completes the proof. □ 

3.3. Tradeoff between the matrix norm and the magnitude of entries. 

We would like now to gain more control over the logarithmic factor than we 
have in Proposition 13.41 Our next result establishes a tradeoff between the 
logarithmic factor and the magnitude of the matrices A, B. It will be used in 
the proof of Theorem 13.91 

Proposition 3.7. Let a, b > and N,n be positive integers. Let A be an 
N x n matrix whose entries are random independent variables aij with mean 
zero and such that 

Ea?- < 1, lay | < a for every 
Let B be an n x N matrix such that \\B\\ < 1, and whose columns satisfy 

II -Bill 2 ^ b for every i. 

Then 

E\\BA\\ < C(l + ab 1/2 log 1/4 (2n)) v / ^. 



The proof will again be based on M. Rudelson's Theorem I3.1[ although this 
time we use Rudelson's theorem in a more delicate way: 



Lemma 3.8. Under the assumptions of Proposition 3.1 , we have 
E max II V a^Bi <g> B { < C(l + a 2 b^\og{2n)). 

7=1, ...,n ' ^ J 
i=l 
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Proof. Fix j G {1, . . . ,n}. Let /i 2 - := Ea 2 -. By the triangle inequality, 



N 



"13 
N 



N 



(3.6) 



8=1 



1=1 



1=1 



Since < < 1 and 
(3.7) 
we have 
(3.8) 



N 



i=i 



< iisir < 1, 



N 



N 



8=1 



i=l 



< 1. 



Next, clearly //?• < a 2 , so 



e(4--40 



0. 



4-4-1 ^ 2a • 



Symmetrization Lemma 12.71 yields 



A' 



N 



(3.9) E max > (a 2 - — 4 

S 4 < 2a 2 E max > £j 3 '-Bj <8> -Bj 

j=l,...,n || ^— ' J J j=l,...,n || ^— ' 



i=l 



i=l 



where £y denote independent symmetric Bernoulli random variables. 

Let t > 0. By the second part of M. Rudelson's Theorem 13.11 and taking the 
union bound over n random variables, we conclude that, with probability at 
least 1 — 2n 2 e~ ct , we have 



N 



N 



max > £iiBi®Bi <t- max ll-BjlU • > Bi®Bi 

i=l,...,n II ^- — ^ J i=l,...,N \\ 



i=l 



i=l 



1/2 



< tb 



The second estimate follows from (13. 7p and since maxj ||-Bj||2 < b by the hy- 
pothesis. 

Let s > be arbitrary. We apply the above estimate for t chosen so that 
2n 2 e~ ct = e~ s . This shows that, with probability at least 1 — e~ s , one has 

max y^EijBi <g> Bi <tb< Cib(y/log(2n) + s). 

j=l,...,n || f 
i=l 

Integration implies that 

N 

E max II y^eijBi <g> B { < C 2 b^log(2n). 

7=1, ...,n ' J 
t=l 

Putting this into (13. 9p and, together with (13.81) . back into (I3.6p . we complete 
the proof. □ 
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Proof of Proposition ~37h By the symmetrization argument (see (12. we can 
assume that the entries of the matrix A are g^a^, where random vari- 

ables satisfying the assumptions of the proposition, and are independent 
standard normal random variables. We will write E fl , F g when we take ex- 
pectations and probability estimates with respect to {g^) (i.e. conditioned on 
(oij)), and we write E a to denote the expectation with respect to 
By Lemma 13. 8^ the random variable 

N 

K 2 := max af.Bi <8> B { 

7 = 1 n II ^ 3 



a 



■'J , 



1=1 



which does not depend on the random variables (gij), has expectation 



(3.10) E a (K 2 ) < C(l + a 2 b^\og{2n)). 

We condition on the random variables (a^); this fixes a value of K. 
Let X\, . . . , X n G W 1 denote the columns of the matrix BA; then 

N 

Xj = ^2 .'A/''./ /)> - j = 1, . . . , n. 

i=i 

Consider a (l/2)-net A/" of the unit Euclidean sphere S™ -1 of cardinality |JV| < 
5™, which exists by Lemma [2.51 Using Proposition 12.61 we have 

n 

(3.11) ||fiAf = \\(BA)*\\ 2 < 4m&x\\{BA)*x\\ 2 2 =4max^(X j ,x) 2 . 

ie xe j=i 

Fix x G M . For every j = 1, . . . , n, the random variable 

TV 

(Xj,x) = ^gijiaijB^x) 
i=i 

is a Gaussian random variable with mean zero and variance 

N N 

< K 2 . 

*j • - 

i=l i=l 

(To obtain the first inequality, take the supremum over x G S 1 ™ -1 ). Therefore, 
by Corollary 12.21 with di = (Var(Xj, x)) 1 ^ 2 < K, we have for every t > 0: 



^2(aijBi, x) 2 < II a % B i ® B ^ 



Let s > be arbitrary. The previous estimate for t = sK^/n gives 
P 9 {(^(X,,x) 2 ) 1/2 > (1 + S )^} < e- coA \ 

3=1 
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Taking the union bound over x G TV and using ( 13. lip , we obtain 

P fl {||£A|| > 2(1 + s)Ky/n) < |yV| e - cos2n = 5 n e- cos2n < e {2 - cos2)n . 
Integration yields 

EJBA|| < CK^i. 

Finally, we take expectation with respect to the random variables (ay) and 
use ( 13.1 Op to conclude that 

E\\BA\\ < CE a (K)^/E <Cx(l + a 2 6 v / log(2n)) 1/2 v ^. 
This completes the proof. □ 

3.4. Theorem 11.11 for logarithmically small columns. Our next step is 
to combine Propositions 13.41 and 13.71 and obtain a weaker version of the main 
Theorem ll.il - this time with the correct bound 0{-\/n) on the norm, but under 
the additional assumption that the columns of the matrix B are logarithmically 
small. 

Theorem 3.9. Let e G (0,1) and let N,n be positive integers. Consider an 
N x n random matrix A whose entries are independent random variables with 
mean zero and (4 + e)-th moment bounded by 1. Let B be an n x N matrix 
such that \\B\\ < 1, and whose columns satisfy for some M > 1 that 

\\Bi\\2 < Mlog~2~~(2n) for every i. 

Then 

E\\BA\\ < CM l/2 ^/n. 

Proof. By the symmetrization argument described in Section [2], we can assume 
without loss of generality that all entries ay of the matrix A = (ay) are 
symmetric random variables. Let 

a := log2?(2n). 

We decompose every entry of the matrix A according to its absolute value as 

(lij := &y l{|a 4j |<a}) Qy := a ij^-{\a i:j \>a}- 

Then all random variables ay and 5y have mean zero, and we have the following 
decomposition of matrices: 

BA = BA + BA, where A = (ay), A = (ay). 

The norm of BA can be bounded using Proposition I3.4L Indeed, by the 
Truncation Lemma [2.81 with p = 1 + e/4, we have 

Wa% = Eajl {a 4 >0 4 } < — f - < a" £ , 

J (1 
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where the last inequality follows from the moment hypothesis. Therefore, the 
matrix a e A satisfies the hypothesis of Proposition 13.41 which then yields 

E||Si|| < CaT e yJn\og(2n) = Cy/n. 

The norm of BA can be bounded using Proposition 13.71 which we can apply 
with a as above and b = M\og~^~~ (2n). This gives 

E\\BA\\ < C(l + ab 1/2 log 1/4 (2n))v^ < 2CM 1/2 ^, 

where the last inequality follows by our choice of a and b. 

Putting the two estimates together, we conclude by the triangle inequality 
that 

E\\BA\\ < E\\BA\\ + E\\BA\\ < C'M 1,2 ^n~. 
This completes the proof. □ 

Remark. The factor M 1//2 in the conclusion of Theorem 13.91 can easily be im- 
proved to about M £ l 2 by choosing a = tlog^(2n) in the proof and optimizing 
in t. We will not need this improvement in our argument. 

4. Approach via concentration 

In this section, we develop an alternative way to bound the norm of BA, 
which rests on Gaussian concentration inequalities and elaborate choice of e- 
nets. The main technical result of this section is the following theorem, which, 
like Theorem 13.91 gives the correct bound 0(^Jn) under some boundedness 
assumptions on the entries of A. 

Theorem 4.1. Let e e (0, 1), M > 1 and let N > n be positive integers such 
that log(2iV) < Mn. Consider an N x n random matrix A whose entries are 
independent random variables with mean zero and such that 

i2+f ( Mn \ 

E\ aij \ < 1, |ay| < { hg ( 2N ^ ) f° r every 

Let B be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < C(e)VMn 
where C(e) depends only on e. 

Remarks. 1. If the entries have bounded (4 + e)-th moment, it is easy to 

check that max^ ~ (nN)^ holds with high probability. Therefore, under 
the (4+e)-th moment assumption, the hypotheses of Theorem 14. II are satisfied 
for almost square matrices, i.e. those for which iV < n 1+C£ . This will quickly 
yield the main Theorem 11.11 for almost square matrices, see Corollary 14.111 
below. 
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2. The hypotheses of Theorem 14.11 are almost sharp when N ~ n. Indeed, 
let us assume for simplicity that the random variables identically dis- 
tributed and B is the identity matrix. The (2 + e)-th moment hypothesis is 

almost sharp: if Ea| > 1 then (E||A|| 2 ) 1/2 > Q||A||| S ) 1/2 > y/n. Also, the 
boundedness hypothesis is almost sharp, since \\A\\ > maxjj |ay|. 

3. Using M. Talagrand's concentration result, Theorem 12 A\ one can also 
obtains tail bounds for the norm 115 A II: 



Corollary 4.2. Under the assumptions of Theorem \4-l\ one has for every 
t > 0: 

F(\\BA\\ > (C(e) + t)v / Mn) < 4e" t2/4 . 
In particular, one has for every q > 1 : 

(E\\BA\\^ < C (e) V^Mn. 

Proof. We can consider the N xn matrix v4 as a vector in K JVn . The Euclidean 
norm of such a vector equals the Hilbert- Schmidt norm ||A||hs- Since ||-BA|| < 
||5||||A|| < 1 • ||A||hs, the function / : R Nn ->■ R defined by f(A) = \\BA\\ 
is 1-Lipschitz and convex. Since we have ja^l < a/ Mn for all z,j by the 
assumptions, M. Talagrand's Theorem 12.41 gives 



F(\\BA\\ - E\\BA\\ > ty/Mn) < 4e~* /4 , t > 0. 
The estimate for E||£L4.|| in Theorem 14.11 completes the proof. □ 

4.1. Sparse matrices: rows and columns. Theorem 14.11 will follow from 
our analysis of sparse matrices. We will decompose the entries according 
to their magnitude. As the magnitude increases, the moment assumptions will 
ensure that there will be fewer such entries, i.e. the resulting matrix becomes 
sparser. 

We start with an elementary lemma, which will help us analyze the magni- 
tude of the rows and columns of the matrix BA when A is a sparse matrix. 

Lemma 4.3. Let N, n be positive integers. Consider independent random 
variables aij, i = 1, . . . , N, j = 1, . . . , n. Let p G (0, 1], and suppose that 

Ea?- < p, \a,ij\ < 1 for every 

Let B be an n x N matrix such that \\B\\ < 1, whose columns are denoted Bi. 
Then 



(4.1) E max a% < C{np + log(2JV)), 

* =1 ''"' i=i 

TV 

(4.2) E max V < C(np + log(2n)). 

7=1, ....re ' * 



1=1 
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Remark. The test case for this lemma, as well as for most of the results that 
follow, is the random variables with values in { — 1,0,1} and such that 
P(djj 7^ 0) = p. The N x n random matrix A = (a^) will then become sparser 
as we decrease p; it will have on average np nonzero entries per row. Estimate 
(14. ip gives a bound on the Euclidean norm of all rows of A. 

Proof. We will only prove inequality (14. 2\i : the proof of inequality (14. ip is 
similar. By the assumptions, we have 

Var(a 2 ) < Ea^- < Ea?- < p for every i, j. 

Also, recall that (12 .4p gives 

TV N N 

||-Bj||2 < U, I|-S«ll2 — m aX ||Sj||2 • ||-Bi||2 — n - 

i=l i=l i=l 

Consider the sums of independent random variables 

N 

s i '■= ^2 a %\\ B i\\li j = l,...,n. 

i=l 

The above estimates show that for every j we have 

N N 

ESj = J]E(4) II^U* < np, Var(^) = ^ Var(aJ) \\B t \\\ < np. 
i=i i=i 

We apply Bennett's inequality, Theorem 12 .3[ for Xi = \{a 2 - — Ea|-)||5j|||, 
which clearly satisfy \Xj\ < 1 because \a^\ < 1 and ||-Bj|| 2 < 1 by (I2.4p . We 
obtain 

(4.3) ^{\{Sj - ESj) > t} < exp ( - a 2 h(t/a 2 )) 

where E(|5 , J -) < np and a 2 = Vax(^Sj) < np. Note that h(x) > cx for 
x > 1, where c is some positive absolute constant. Therefore, if t > np, then 
a 2 h{tja 2 ) > ct, so (@~3]) yields 

P{5j > 2t} < e~ ct for t > np. 

Taking the union bound over all j, we conclude that 

P{ max Sj >2t\ < ne~ ct for t > np. 

Now let s > 1 be arbitrary, and use the last inequality for t = (np + log(2n))s. 
We obtain 

P{ max Sj > 2{np + log(2n))s| < ne - clog(2n)s = 2~ cs n 1 - cs . 

j=l,...,n 

Integration yields 

E max Sj < C(np + log(2n)). 

j=l,...,n 
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This completes the proof of (14.21) . 



□ 



The estimates in Lemma 14.31 motivate us to consider the class of N x n 
matrices A = (a#) whose entries satisfy the following inequalities for some 
parameters p 6 (0, 1] and K > 1: 



max | ay | < 1 ; 



^ 1 /2 

. max^ ( 4 ) ^ K v^P + log(2iV); 



(4.4) 

max f y^Qjjll-Billa) < K^Jnp + log(2n). 



3=1 
N 



i=l 



We have proved that for random matrices whose entries satisfy |ay| < 1 and 
— Pi conditions (14. 4 ft hold with a random parameter K that satisfies 
EK < C. 



4.2. Concentration for a fixed vector. Our goal will be to estimate the 
magnitude of for matrices of the form A = (gijCiij), where gij are in- 

dependent standard normal random variables, and fixed numbers that 

satisfy conditions (I4.4p . Such an estimate will be established in Proposition 14.81 
below. By the standard symmetrization, the same estimate will hold true if 
A = (aij) is a random matrix with entries as in Lemma I4T31 This will be done in 
Corollary 14.91 Finally, Theorem 14.11 will be deduced from this by decomposing 
the entries of a random matrix according to their magnitude. 

Our first step toward this goal is to check the magnitude of ||.BAr||2 for a 
fixed vector x. 

Lemma 4.4. Let N, n be positive integers. Consider an N x n random matrix 
A = (gijOij) where g^ are independent standard normal random variables and 
are numbers that satisfy conditions (I4.4p . Let B be an n x N matrix such 
that ||Z?|| < 1. Then, for every vector x G we have 

¥\\BAx\\ 2 < K^/np + \og(2n). 
Proof Denoting as usual the columns of B by Bi, we have 

N n 

BAx = ^ ( y^!l'j"'.r r .i) !>'■ 

i=l j=l 
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Since \\x\\2 < 1 and using the last condition in ( j4.4p . we have 

TV n 

nBAx\\\ = Y,Y. a l x *\\ B ^ 

i=i j=i 

n N 



j=l i=l 

N 



< max alWBiWl < K 2 (np + \og(2n)) . 

7=1, ...,n ^— ' J 



8=1 



This completes the proof. □ 

We will now strengthen Lemma T4.4I into a deviation inequality for ||St4x||2. 
This is a simple consequence of the Gaussian concentration, Theorem 12. II This 
deviation inequality is universal in that it holds for any vector x; in the sequel 
we will need more delicate inequalities that depend on the distribution of the 
coordinates in x. 



Lemma 4.5 (Universal deviation). Let A and B be matrices as in Lemma \J^ 

Then, for every vector x G B^ and every t > we have 

(4.5) P{||£Ar|| 2 > Ky/np + log(2n) + t] < e 

Proof. As in the proof of Lemma 14.41 we write 



c t z 



N n 



BAx = ^2 (^2 9ij a ij x j) Bj 



i=i j=i 

where Bi are the columns of the matrix B. Therefore, the random vector BAx 
is distributed identically with the random vector 

N n 1 , 2 

^giXiBi, where Xi = ( a^xf) 
i=i j=i 

and where ^ are independent standard normal random variables. Since all 
\ a ij\ < 1 by conditions (I4.4p . and ||x|| 2 < 1 by the assumptions, we have 

0<Ai<l, i = l,...,N. 

Consider the map / : R — > R given by 

N 



f{y) = \\^2yi^iBi 



i=l 
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Its Lipschitz norm equals 



1/2 A 



l ' 2 n n 

< 1 ■ LB < 1. 



i Lip = I x2 i Bi ® Bi - m f x l Ai ' ' || Yl Bi ® B 

i=l ' i=l 

Then the Gaussian concentration, Theorem 12.11 gives for every t > 0: 

P(/((7)-E/^)>t)<exp(-Cot 2 ), 

where g = (g±, . . . , g^). Since as we noted above, f(g) is distributed identically 
with ||I?Ar||2, Lemma [4.41 completes the proof. □ 

4.3. Control of sparse vectors. Since the spectral norm of BA is the supre- 
mum of ||5Ac|| 2 over all x G S n ~ l , the result of Lemma 14.51 suggests that 
E||IL4|| < \Jnp + log iV should be true. However, the deviation inequality in 
Lemma I4T51 is not strong enough to prove this bound. This is because the met- 
ric entropy of the sphere, measured e.g. as the cardinality of its |-net, is e cn . 
If we are to make the bound on ||5Aa;||2 uniform over the net, we would need 
the probability estimate in (14. 5 p at most e~ cn (to allow a room for the union 
bound over e cn points x in the net). This however would force us to make 
t ~ y/n or larger, so the best bound we can get this way is E||£M|| 2 < \fn. 
This bound is too weak as it ignores the last two assumptions in (14.41) . 

Nevertheless, the bound in Lemma 14.51 can be made uniform over a set of 
sparse vectors, whose metric entropy is smaller than that of the whole sphere: 



Proposition 4.6 (Sparse vectors). Let A and B be matrices as in Lemma^J 



There exists an absolute constant c > such that the following holds. Consider 
the set of vectors 

B 2fi := jx G R", ||x|| 2 < 1, ||x|| < cnp/\og(e/p)y 

Then 

E sup ||.RA2;||2 < ZK^/np + log(2n). 

z£-B 2 ,o 

Proof Let c > be a constant to be determined later, and let A := cp/ log(e/p). 
Then 

B 2 ,o = [J B 2 , 

\J\ = [Xn\ 

where the union is over all subsets J C {!,..., n} of cardinality [-^J; an d 
where B^ = {x G M J : ||x||2 < 1} denotes the unit Euclidean ball in IR J . By 
Lemma |2~5"| B^ has a |-net Nj of cardinality at most e 2Xn . Let t > 1. For a 
fixed x G A/j, Lemma 1431 gives 



PjHEAxHa > (K + l)^/np + log(2ri) + t] < exp ( - c (np + t 2 )) . 
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Using Proposition 12.61 and taking the union bound over all x G we obtain 
sup \\BAx\\ 2 > (K + l)y/np + \og(2n)+t} 

2 x^Bl 



< P{ sup ||-BAc|| 2 > (K + 1) y/np + log(2n) + 1 } 

< |A/j| exp ( — c (np + t 2 )) < exp (2An — c (np + t 2 )) . 

Since there are (m^i) < (e/A) An ways to choose the subset J, by taking the 
union bound over all J we conclude that 

(4.6) P{- sup \\BAx\\ 2 > 2(K + 1) y/np + log(2n) + t} 

2 ieB 2 ,o 

< exp (A log(e/ A)n + 2An — c (np + t 2 )) . 

Finally, if the absolute constant c > in the definition of A is chosen sufficiently 
small, we have Alog(e/A)n + 2Xn < c^np. Thus the right hand side of f)4.6p is 
at most 

exp(-c t 2 ). 

Integration completes the proof. □ 

4.4. Control of spread vectors. Although we now have a good control of 
sparse vectors, they unfortunately comprise a small part of the unit ball B^. 
More common but harder to deal with are "spread vectors" - those having 
many coordinates that are not close to zero. The next result gains control of 
the spread vectors. 



Proposition 4.7 (Spread vectors). Let A and B be matrices as in Lemma \J^ 

with N > n. Let M > 2. Consider the set of vectors 

B 2 ,oc ■= \x e R n , \\x\\ 2 < 1, Hxlloo < — }. 



Then 

E sup \\BAx\\ 2 < Clog 3/2 (M) ■ Ky/np + \og(2N). 

Proof. This time we will need to work with multiple nets to account for different 
possible distributions of the magnitude of the coordinates of vectors x G B 2 oo . 
Since \\x\\oq < \\x\\ 2 , without loss of generality we can assume that M < y/n. 

Step 1: construction of nets. Let 

/i fc :=— , k = -2,-l,0,l,2,...,log 2 M 

Jn 



and let 

Af := {x G -82,00 : Vj 3fc such that \xj \ = hk}. 
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A standard calculation shows that Af is an —net of -62,00 i n the -B2 !00 -norm, 
i.e. for every x G -62,00 there exists y G Af such that x — yE |-£>2,oo- Therefore, 
by Proposition 12. 6[ 

sup ||-BAr|| 2 < 2 sup ||-BAr|| 2 . 

Fix x E M. Since \\x\\2 < 1, the number of coordinates of x that satisfy 
\xj\ = hk is at most \_h^ 2 \, for every k. Decomposing x according to the 
coordinates whose absolute value is h k , we have by the triangle inequality that 

log 2 M 

(4.7) sup \\BAx\\ 2 < 2 SU P \\ BA vh, 



£6-82,00 u__ o zeA/j 



v '' k 



where 

M k = {x G El : ||x|| < \ all nonzero coordinates of x satisfy \xj\ = hk}- 

Fix k and assume that Afk 7^ 0- Since ^ < Mj y/n, we have 
(4.8) m := [h k 2 \ > [n/M 2 \ > 1. 

To estimate the cardinality of Afk, note that there are at most min(m, n) ways 
to choose || a: || := I; there are (") ways to choose the support of x; and there are 
2 l ways to choose the (signs of) nonzero coordinates of x. Hence by Stirling's 
approximation and using ( 14.81) . we have 
(4.9) 

min(m,n) , * 

|A4| < ^ fj2 z <min{^) m ,4 n } < (4eM 2 ) m < exp(CmlogM) 
l=1 \ J 171 

where C > 1 is an absolute constant. 

Step 2: control of a fixed vector. Fix m and fix x G A4- As we saw in the 
proof of Lemma I4.5[ 

N 

\\BAxW2 is distributed identically with '^^g i X i B i 



i=i 



where 



x 1/2 



J=l 

and where gi are independent standard normal random variables. Since x G 
Afk, we have ||x||oo = hk < A=. This and the second condition in ( 14. 4ft yield 



a,<[_> a ? .i 1/2 <AV np+log(2Ar) 



i=i 



??? 
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We consider the map / : WL N — > K given by 

N 

f(y) = Wy^yAiBi . 

II 2 
i=l 

Repeating the estimate in the proof of Lemma 14.51 we bound the Lipschitz 
norm as 

/ kip < max Ai < K\ . 

i V m 

Then the Gaussian concentration, Theorem I2.1[ gives for every t > 0: 

P(/to)- E /to)> t )<exp(- ^ (n ;f^ g(2JV)) ), 

where g = (g 1: . . . , g N ). Since as we noted above, /(g) is distributed identically 
with ||5Ar|| 2 , Lemma [4.41 yields that 

HWBMU > AV«P + log(2 n ) + t) < exp ( - g2( jf" 8(2JV)) ), 

Let u > be arbitrary. Applying the above estimate for t = uK\J np + log(2iV) 
and using N > n we conclude that 

(4.10) P(||BAzr|| 2 > (1 + u)K^np + log(2iV)) < exp(— c u 2 m). 

Step 3: union bound. Taking the union bound in (I4.10p over all x e Mk and 
using estimate ( 14. 9 p on the cardinality of Mk, we have for all u > 0: 

P( sup ||-BAe|| 2 > (1 + u)K^Jnp + log(2iV)) < |A4| exp(-c M 2 m) 

< exp(CmlogM — coU 2 m). 

Let s > 1. We choose u = CiSy/\og~M, where C\ := a/C/c . Since it > 1 and 
m > 1, M > 2, we obtain from the above estimate that 

P( sup ||SAc|| 2 > 2C 1 sK v /\og(M)(np + log(2iV))) < exp(C(l - s 2 )mlogM) 

xeAfk 

< exp(c(l - s 2 )). 

Integrating yields that 

E sup ||5Ar|| 2 < C 2 Ky/log(M)(np + log(2iV)). 

Putting this back in (14. 7p . we conclude that 

E sup \\BAx\\ 2 < 2(3 + log M) ■ C 2 K^\og{M)(np + \og{2N)). 

This completes the proof. □ 
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4.5. Norms of sparse matrices, and proof of Theorem 14. 1L Proposi- 
tions 14.61 and 14.71 together handle all vectors in the unit ball, and yield the 
following norm estimate: 



Proposition 4.8. Let A and B be matrices as in Lemma 4-4 with N > n. 
Then 

E\\BA\\ < Clog 3/2 (-) ■ Kx/np + log(2N). 

Proof. Let c be the absolute constant as in Proposition I4.6( we can clearly 
assume that c < 1/4. We define 



M= t/ — log-. 

cp p 

Note that M > 2 as required in Proposition 14.61 

Fix a vector x G B%. We decompose it according to the magnitude of the 
coordinates, as follows: 

x = y + z, y := x ly. |^|>M/Vn}> z := x ly. \ Xj \<M/^}- 

Clearly, ||y||2 < ||x||2 < 1, ||-2||2 < \\ x \\2 ^ 1- By Markov's inequality, we have 
, . . . ,,//—, I n cnp 

Then y G -82,0 as in Proposition 14.61 On the other hand, ||^||oo — M/y/ri by 
definition, so z G -82,00 as in Proposition 14.71 Therefore, by Propositions 14.61 
and 14.71 we have 

E||£M|| = E sup \\BAx\\ 2 <E sup \\BAy\\ 2 + E sup \\BAz\\ 2 

x&B% yeB 2 , zeB 2 ,oo 

< "iKxJnp + log(2n) + C log 3/2 (M) • Kxjnp + log(2iV). 

Our choice of M and the assumption N > n completes the proof. □ 

Finally, a standard symmetrization argument yields the following norm es- 
timate, which we shall use for sparse random matrices. 

Corollary 4.9. Let p G (0, 1] and let N > n be positive integers. Consider an 
N x n random matrix A whose entries are independent random variables 
with mean zero and such that 

E|ajj| 2 < p, \dij\ < 1 for every 

Let B be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < Clog 3/2 (^y/np + log(2N). 

Remark. It would be interesting to remove the logarithmic term from this 
estimate. 
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Proof. Let g^ be independent standard normal random variables. Consider 
the random matrix A = (gijdij). By ( 12. II) . we have 

(4.11) M BA \\ < (2tt) 1/2 E|| 5^11. 

By Lemma 14.31 conditions ( 14. 4 p hold with some random parameter > 1 
which only depends on the random variables (a^) and not on (gij), and which 
satisfies 

(4.12) E a K<d 

where C\ is an absolute constant. Here and below we write E a when the 
expectation is with respect to (ay), and E g if the expectation is with respect 
to (g^). 

Condition on the random variables (a^). Proposition 14.81 then yields 
E g \\BA\\ < Chg i/2 Q • Ky/np + \og(2N). 
Therefore, when we remove the conditioning, we obtain by (I4.12p that 
E||Si|| = E a Ejfli|| < Clog 3/2 (-) • C 1 ^/np + \og(2N). 

This and (14. lip complete the proof. □ 



Proof of Theorem \4-l\ By the standard symmetrization technique described in 
Section (2J we can assume without loss of generality that all symmetric 
random variables. We decompose the matrix A according to the magnitude of 
its entries as follows. Given a subset I C R, we define the truncated matrix 



Consider 



txunc(A,I) = (aijl{\ aij \ e i})- 
A (0) =trunc(A, [0,1]); 

A (k) = 2 -k trunc (^ ) [2 k -\ 2 k }), k = 1,2,... 

Then we have a decomposition A = YlkLo 2 fe ^4 ■ This sum is actually finite 
because of the boundedness assumption on a^-. Indeed, we have 

fco 

(4.13) A = A (0) + J2 2 ^ (fc) 

fc=i 

where k is the maximal integer such that 
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(k) 

Because symmetric random variables, all entries a\j of the matrices 

A™ satisfy EaJ* = and \afX < 1. 

Using Corollary 14.91 for the matrix and p — 1, we obtain 

(4.15) E||SA (0) || < C iy /n + log(2N) < 2C 1 v / M^, 

where the last line follows because log(2iV) < Mn and M > 1 by the hypoth- 
esis. 

Now we fix 1 < k < k . Using the (2 + e)-th moment assumption, we have 
by Markov's inequality that 



,(*) 



+ 0) < P(o ii > 2*" 1 ) < 2~( 2+£ )( fc - 1 ) =: p k . 



This and the bound | < 1 yield E(ajk ) 2 < p^. With this, we apply Corol- 
lary 14.91 for the matrix Aw and obtain 

E\\BA (k) \\ < Clog 3/2 f— ) Jnp k + log(2N). 

By the definition of p k and by flUHJ), we have 

loggAQ 
P k > P k0 > • 

Therefore, np^ + log(2iV) < (1 + M)np k < 2Mnp k , so 

E||5A (fe) || < Cdog 3/2 f — ) y/2Mnp k 
(4.16) < C 2 [1 + (2 + e)(Jfc - i)] 3 / 2 2-( 1 + e / 2 )( fc - 1 ) . VMn. 

Using (14.131) and the triangle inequality, then using (I4.15P and (I4.16J) . we 
conclude that 

fco 

E||BA|| < E||5A (0) || +^2 fe E|| J BA (fc) || 

fc=i 

fco 

< 2Civ / M^+ 5^C 2 [1 + (2 + e)(Jfe - i)] 3/2 2 fe -(i+e/2)(fc-i) . Vml 
fe=i 

oo 

^Csv^-^A; 3 /^-^ 
fc=i 

= C( £ )v / M^. 

This completes the proof of Theorem 14.11 □ 
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4.6. Almost square matrices. The main application of Theorem 14.11 is for 
almost square matrices - those for which N = n 1+0 ^\ The next lemma verifies 
the hypotheses of Theorem 14.11 for such matrices. 

Lemma 4.10. Let e G (0, 1) and let N,n be positive integers satisfying N < 
n i+e/io_ f, e flfl jy x n random matrix whose entries are independent 

random variables with (4 + e)-th moment bounded by 1. Define the random 
variable M by the equation 

(ai>?\ i i ( Mn 

(4.17) max Oj,- = - — . — r- 

v ; ij 1 Jl Vlog(2iV)/ 

Then, for every t > 1, one has 

P(M > C(s)t) < i 
In particular, one has EM < C\ (e) . 

Proof. By Markov's inequality, we have for every i,j that 

p(K-l>a) *>o. 



Let t > 1. We then have 



KM > (t 2 nN)^) < -L. 
Taking the union bound over all nN random variables a^-, we obtain 
(4.18) P(max|a ij | > (t 2 nN)^) < — . 

The assumption N < n 1+£ ^ 10 yields that 

C(e)n \ 2 +^/8 



niV < i j i 

Jog(2AT)y 

Therefore, since 2 t £ ^ 8 < n 1 ,. and t > 1, we have 

' 4+e — 2+£/4 — ' 



v ; ~ Vlog(2iV)/ 



Using this in (I4.18p . we obtain 

P(M> G ( e)i )<p(™x M > (^1^)^)4. 

Integration completes the proof. □ 

We are now ready to state and prove a partial case of Theorem 1 1.1 1 for almost 
square matrices. 
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Corollary 4.11. Let e G (0,1) and let N,n be positive integers satisfying 
N < n l+£ / w . Let A be an N x n random matrix whose entries are independent 
random variables with mean zero and (4 + e)-th moment bounded by 1. Let B 
be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < C{e)^/n. 

Proof. Without loss of generality we may assume that iV > n by adding an 
appropriate number of zero rows to A and zero columns to B. Also, using the 
standard symmetrization, we can assume that the random variables are 
symmetric. Let M be the random variable as in Lemma [4.101 and let t > 1. 
By the definition, {M < t} is the product event. Therefore, conditioning on 
this event (i) preserves the independence of the entries of A; (ii) makes all 
these entries bounded as in (14.171) ; (iii) can only reduce their moments by 
Lemma [2.91 thus for all i,j we have 

E(\ aij \ 2+£ / 4 \M<t)< E\ aij \ 2 +£ / 4 < 1. 

Therefore, we can apply Corollary 14.21 conditionally, with e/4 and with M 
replaced by max(M, 10), which gives 

[E(\\BA\\ 2 \M <t)] 1/2 < C (e)Vtn for t > 1. 
Additionally, by Lemma 14.101 we have 

P(M > C{e)t) <\ for t > 1. 

L 

By Lemma [2.101 this yields 

E\\BA\\ < (E\\BA\\ 2 ) 1/2 < C x (e)y/ri 
as claimed. □ 

5. Completion of the proof of Theorem 11.11 

Proof of Theorem By adding an appropriate number of zero rows to B or 
zero columns to A we can assume that m — n, thus B is an n x iV matrix. 
Consider the exponent 

K = K{e) = 1 + 1. 

As usual, let Bi, . . . , B^ be the columns of the matrix B. Consider the subset 
I C {1, . . . , N} of large columns defined as 

I:={i: \\Bi\\ 2 >C (e)\og- K (2n)}. 

Here we choose Cq(e) sufficiently large so that, by ( 12.41) and Markov's inequal- 
ity, we have 

AT := |/| < C (e)- 2 nlog 2K "(2n) < n l+e/w . 
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Denote by Aj the No x n sub matrix of A whose rows are in J, by Bj the 
n x N submatrix of B whose columns are in / (and similarly for I c ). The 
decomposition BA = BjAj + BjcAjc implies by the triangle inequality that 

(5.1) \\BA\\ < H-BMjII + ||B/cA/c||. 

This splits our problem into two subproblems, one for I and one for J c . Of 
course, if I or I c is empty then the corresponding matrix is zero and we can 
skip its estimation. 

The matrices Aj, Bj are almost square, so Corollary 14.111 applies for them, 
giving 

(5.2) E||S/A/|| < C(e)^n. 

On the other hand, the columns of the matrix Bj c are small by the definition 
oil: 

\\Bi\\ 2 < C (e) log"^(2n) for every i e I c . 
Therefore, Theorem 13.91 applies to the matrices Ajc, Bjc, which gives 

(5.3) EUSjcA/cll < C x (e)y/n. 
Putting estimates (15.21) and (I5.3P into (15.11) . we conclude that 

E\\BA\\ < C 2 (e)V^- 
Theorem 11.11 is proved. □ 
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SPECTRAL NORM OF PRODUCTS OF RANDOM AND 
DETERMINISTIC MATRICES 



ROMAN VERSHYNIN 

Abstract. We study the spectral norm of matrices W that can be factored 
as W = BA, where A is a random matrix with independent mean zero 
entries and B is a fixed matrix. Under the (4 + e)-th moment assumption 
on the entries of A, we show that the spectral norm of such an m x n matrix 
W is bounded by y/m + y/n, which is sharp. In other words, in regard to 
the spectral norm, products of random and deterministic matrices behave 
similarly to random matrices with independent entries. This result along 
with the previous work of M. Rudelson and the author implies that the 
smallest singular value of a random m x n matrix with i.i.d. mean zero 
entries and bounded (4 + e)-th moment is bounded below by \fm — \jn—\ 
with high probability. 



1. Introduction 

This paper grew out of an attempt to understand the class of random matri- 
ces with non-independent entries, but which can be factorized through random 
matrices with independent entries. Equivalently, we are interested in sample 
covariance matrices of a wide class of random vectors - the linear transforma- 
tions of vectors with independent entries. 

Here we study the spectral norm of such matrices. Recall that the spectral 
norm \\W\\ is defined as the largest singular value of a matrix W, which equals 
the largest eigenvalue of \/WW*. Equivalently, the spectral norm can be 
defined as the £2 — > £2 operator norm: \\W\\ = sup x \\Wx\\2/\\x\\2 where || • H2 
denotes the Euclidean norm. The spectral norm of random matrices plays a 
notable role in particular in geometric functional analysis, computer science, 
statistical physics, and signal processing. 

1.1. Matrices with independent entries. For random matrices with inde- 
pendent and identically distributed entries, the spectral norm is well studied. 
Let W be an m x n matrix whose entries are real independent and identically 
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distributed random variables with mean zero, variance 1 and finite fourth mo- 
ment. Estimates of the type 

(1.1) ||W|| ~ Vn + y/m 

are known to hold (and are sharp) in both the limit regime for dimensions 
increasing to infinity, and the non-limit regime where the dimensions are fixed. 
The meaning of (11. ip in the limit regime is that, for a family of matrices as 
above whose dimensions m and n increase to infinity and whose aspect ratio 
m/n converges to a constant, the ratio ||VT||/(\/n + \/m) converges to 1 almost 
surely [32] . 

In the non-limit regime, i.e. for arbitrary dimensions n and m, variants of 
(11.11) were proved by Y. Seginer [28] and R. Latala [T7j. If W is an m x n 
matrix whose entries are i.i.d. mean zero random variables, then denoting the 
rows of W by Xj and the columns by Yj, the result of Y. Seginer states 
that 

E||W|| < C (E max H^Ha + E maxilla) 
% o 

where C is an absolute constant. This estimate is sharp because ||W|| is 
obviously bounded below by the Euclidean norm of any row and any column 
of W. Furthermore, if the entries Wij of the matrix W are not necessarily 
identically distributed, then R. Latala's result p2] states that 

E||W|| < C(maxE||X i || 2 + maxE||Y J -|| 2 + (^Eu>J) 1/4 ). 

1,3 

In particular, if W is an m x n matrix whose entries are independent random 
variables with mean zero and fourth moments bounded by 1, then one can 
deduce from either Y. Seginer's or R. Latala's result that 

(1.2) ¥.\\W\\<C{Vn + Vm). 

This is a variant of (11. ip in the non-limit regime. 

The fourth moment hypothesis is known to be necessary. Consider again a 
family of matrices whose dimensions m and n increase to infinity, and whose 
aspect ratio m/n converges to a constant. If the entries are independent and 
identically distributed random variables with mean zero and infinite fourth 
moment, then the upper limit of the ratio ||W||/(-y/n + \fm) is infinite almost 
surely [32] . 

1.2. The main result. The main result of this paper is an extension of the 
optimal bound (II .2p to the class of random matrices with non-independent 
entries, but which can be factored through a matrix with independent entries. 

Theorem 1.1. Let e G (0, 1) and let m, n, N be positive integers. Consider a 
random m x n matrix W = BA, where A is an N x n random matrix whose 



3 



entries are independent random variables with mean zero and (A+e)-th moment 
bounded by 1, and B is anmxN non-random matrix such that \\B\\ < 1. Then 

(1.3) E\\W\\<C(e)(V^+Vm) 
where C(e) is a function that depends only on e. 

Remarks. 1. An important feature of this result is that its conclusion is inde- 
pendent of the dimension N. 

2. The proof of Theorem 11.11 yields the stronger estimate 

(1.4) E\\W\\ < C(s)(\\B\\^+ \\B\\b3) 

valid for arbitrary (non-random) m x N matrix B. This result is independent 
of the dimensions of the matrix B, and therefore it holds for an arbitrary linear 
operator B acting from the N- dimensional Euclidean space ^ t° an arbitrary 
Hilbert space. 

3. Theorem 11.11 can be interpreted in terms of sample covariance matrices 
of random vectors in M m of the form BX, where X is a random vector in 
M. N with independent entries. Indeed, let A be the random matrix whose 
columns are n independent samples of the vector X. Then W = BA is the 
matrix whose columns are n independent samples of the random vector BX. 
The sample covariance matrix of the random vector BX is defined as E = 
^WW*. Theorem 11.11 states that the largest eigenvalue of S is bounded by 
Ci(e)(l + m/n), which is further bounded by C^ie) for the number of samples 
n >m (and independently of the dimension N). This problem was previously 
studied in [I], [5] in the limit regime for m — N, where the result must of 
course depend on N. 

4. Under the stronger subgaussian moment assumption ( II. (jp on the entries, 
Theorem 11.11 is easy to prove using standard concentration and an e-net argu- 
ment. In contrast, if only some finite moment is assumed, we do not know any 
simple proof. 

1.3. The smallest singular value. Our main motivation for Theorem 11.11 
was to complete the analysis of the smallest singular value of random rect- 
angular matrices carried out by M. Rudelson and the author in [27]. The 
smallest singular value s m i n (W) of a matrix W can be equivalently described 
as s miQ (W)=mf x \\Wx\\ 2 /\\x\\ 2 . 

Analyzing the smallest singular value is generally harder than analyzing the 
largest one (the spectral norm) . The analogue of ( II. ip for the smallest singular 
value of random m x n matrices W (for m > n) is 



(1.5) 
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The optimal limit version of this result proved in [7] holds under exactly the 
same hypotheses as (II. ip - for i.i.d. entries with mean zero, variance 1 and 
finite fourth moment. 

Many papers addressed (jl.5p for fixed dimensions n, m. Sufficiently tall 
matrices (m > Cn for sufficiently large C) were studied in [8]; extensions to 
genuinely rectangular matrices (m > (1 + e)n for some e > 0) were studied in 
[20| [2J [23] , with gradually improving dependence on e. An optimal version of 
( II. 5p for all dimensions was obtained in [27]. All these works put somewhat 
stronger moment assumptions than the fourth moment of the entries of the 
matrix W . A convenient assumption is that the entries Wij are subgaussian 
random variables. This means that all their moments are bounded by the 
corresponding moments of the standard normal random variable, i.e. 

(1.6) (E\ Wlj \ p ) 1/p < My/p for all p > 1 

where M is called the subgaussian moment. It was proved in [27] that if 
the entries of W are i.i.d. mean zero subgaussian random variables with unit 
variance, then for every t > one has 

(1.7) P(s min (W) <t{yM~ V^l)) < (Ct) m - n+1 + e~ cm 

where C, c > depend only on the subgaussian moment M. In particular, for 
such matrices we have 

(1.8) s m i n (W) > Ci(y/m — \/n — 1) with high probability 

where c\ > depends only on the desired probability and the subgaussian 
moment. This result encompasses the case of square matrices where m = n 
and hence (I1.8P yields s m i n (W) > C2/y/n. For Gaussian square matrices this 
optimal bound was obtained in [TT] and [29J; for general square matrices a 
weaker bound n~ 3 / 2 was obtained in [21] and the best bound as above in [25] ; 
the estimate is shown to be optimal in [26] . 

Whether (II. 8p holds under weaker moment assumptions was only known in 
the case of square matrices. It was proved in [25] using (II. 2p that (11.81) holds 
under the fourth moment assumption for square matrices, i.e. for m = n. 
Whether the same is true for arbitrary rectangular matrices under the fourth 
moment assumption was left open in [27J. The bottleneck of the argument 
occurred in Proposition 7.3 on [27] where we needed a correct bound on the 
spectral norm of a product of a random matrix and a fixed orthogonal projec- 
tion. Such a bound was easy to get only under the subgaussian hypothesis. 
Theorem 1 1.1 1 of the present paper extends the argument of [27] for random ma- 
trices with bounded (4 + e)-th moment. It follows directly from the argument 
of [27] and Theorem 11.11 

Corollary 1.2 (Smallest singular value). Let e e (0, 1) andm > n be positive 
integers. Let A be a random m x n matrix whose entries are i.i.d. random 
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variables with mean zero, unit variance and (4 + e)-th moment bounded by M. 
Then, for every 5 > there exist t > and n which depend only on e, 5 and 
M, and such that 

^(s mm iA) < t(y/rn — \/n — l) J < 5 for all n > uq. 

This result follows by the argument in [27], where one considers probability 
estimates conditional on the event that the norm of a product W of a random 
matrix and a non-random orthogonal projection is small (see J27J Proposi- 
tion 7.3]). 

After this paper was written, two important related results appeared on the 
universality of the smallest singular value in two extreme regimes - for almost 
square matrices and for genuinely rectangular matrices. One of these results, 
by T. Tao and V. Vu [31] works for square and almost square matrices where 
the the defect m — n is constant. It is valid for matrices with i.i.d. entries with 
mean zero, unit variance and bounded C-th moment where C is a sufficiently 
large absolute constant. The result states that the smallest singular value of 
such mxn matrices A is asymptotically the same as of the Gaussian matrix G 
of the same dimensions and with i.i.d. standard normal entries. Specifically, 

(1.9) P(ms min (G) 2 < t - m- c ) - m~ c < F(ms min {A) 2 < t) 

< P(ms min (G) 2 < t + m' c ) + mT c . 

This universality result, combined with the known asymptotic estimates of 
the smallest singular value of Gaussian matrices s m i n (G) allows one to obtain 
bounds sharper than in Corollary 11.21 However, the universality result of [3T] 
is only known in the almost square regime m — n = 0(1) (and under stronger 
moment assumptions), while Corollary 11.21 is valid for all dimensions m > n. 

Another recent universality result was obtained by O. Feldheim and S. Sodin 
[T2] for genuinely rectangular matrices, i.e. with aspect ratio m/n separated 
from 1 by a constant, and with subgaussian i.i.d. entries. In particular they 
proved the inequality 

C 

(1.10) F(s min {A) < (Vm - v 7 ^) 2 - tm) < exp(~cnt 3/2 ). 

1 — wm/n 

Deviation inequalities (jl.7]l and fll.lOp complement each other - the former 
is multiplicative (and is valid for arbitrary dimensions) while the latter is 
additive (and is applicable for genuinely rectangular matrices). Each of these 
two inequalities clearly has the regime where it is stronger. 

1.4. Outline of the argument. Let us sketch the proof of Theorem ll.il We 

can assume that m = n by adding an appropriate number of zero columns 
to A or rows to B. Since the columns of A are independent, the columns 
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Xi, . . . , X n of the matrix W are independent random vectors in IR n . We would 
like to bound the spectral norm of WW* = J2jXj ® Xj, which is a sum of 
independent random operators. For random vectors Xj uniformly distributed 
in convex bodies, deviation inequalities for sums J2j Xj ® Xj were studied in 
[T5l [T0| 1221 [TH [2T| [3j [1] . For general distributions, a sharp estimate for such 
sums has been proved by M. Rudelson [22J. This approach, which we develop 
in Section [3j leads us to the bound 

(1.11) E||W|| < Cy/n logn. 

This bound is already independent of the dimension N , but is off by ylogn 
from being optimal. The logarithmic term is unfortunately a limitation of 
this method. This term comes from M. Rudelson's result, Theorem 13.11 be- 
low, where it is needed in full generality. It would be useful to understand the 
situations where the logarithmic term can be removed from M. Rudelson's the- 
orem. So far, only one such situation is known from [1] where the independent 
random vectors Xj are uniformly distributed in a convex body. 

In absence of a suitable variant of M. Rudelson's theorem without the loga- 
rithmic term, the rest of our argument will proceed to remove this term from 
( II. lip using the rich independence structure, which is inherited by the vectors 
Xj from the random matrix A. However, the independence structure is en- 
coded nontrivially via the linear transformation B, which makes the entries of 
Xj dependent). A more delicate application of M. Rudelson's theorem allows 
one to transfer the logarithmic term from the conclusion to the assumption. 
Namely, Theorem 13.91 establishes the optimal bound E||VK|| < C^fn in the case 
when all columns of B are logarithmically small, i.e. their Euclidean norm is 
at most \og~°^ n. While some columns of a general matrix B may be large, 
the boundedness of B implies that most columns are always logarithmically 
small - all but all but n log ^ n of them. So, we can remove from B the 
already controlled small columns, which will make B an almost square matrix. 
In other words, we can assume hereafter that N = n log ^ n. 

The advantage of almost square matrices is that the magnitude of their 
entries is easy to control. A simple consequence of the (4 + e)-th moment 
hypothesis and Markov's inequality yields that the entries of A = (ay) satisfy 
maxj j \dij\ < y/n with high probability. Note that the same estimate holds for 
square matrices (N = n) under the fourth moment assumption. So, in regard 
to the magnitude of entries, almost square matrices are similar to exactly 
square matrices, for which the desired bound follows from R. Latala's result 

This prompts us to construct the proof of Theorem 11.11 for almost square 
matrices similarly to R. Latala's argument in [17], i.e. using fairly standard 
concentration of measure results in the Gauss space, coupled with delicate 
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constructions of nets. We first decompose A into a sum of matrices which con- 
tain entries of similar magnitude. As the magnitude increases, these matrices 
become sparser. This quickly reduces the problem to random sparse matrices, 
whose entries are i.i.d. random variables valued in { — 1,0,1}. The spectral 
norm of random sparse matrices was studied in [16] as a development of the 
work of Z. Furedi and J. Komlos [13]. However, we need to bound the spec- 
tral norm of the matrix W = BA rather than A. Independence of entries is 
not available for W, which makes it difficult to use the known combinatorial 
methods based on the bounding trace of high powers of W. 

To summarize, at this point we have an almost square random sparse matrix 
A, and we need to bound the spectral norm of W = BA, which is ||W|| = 
sup x ||Wa;||2, where the supremum is over all unit vectors x G W 1 . The well 
known method is to first fix x and bound ||Wa;||2 with high probability; then 
take a union bound over all a; in a sufficiently fine net of the unit sphere of 
MJ 1 . However, a probability bound for every fixed vector x, which follows from 
standard concentration inequalities, is not strong enough to make this method 
work. Sparse vectors - those which have few but large nonzero coordinates 
- produce worse concentration bounds than spread vectors, which have many 
but small nonzero coordinates. What helps us is that there are fewer sparse 
vectors on the sphere than there are spread vectors. This leads to a tradeoff 
between concentration and entropy, i.e. between the probability with which 
||Wa;||2 is nicely bounded, and the size of a net for the vectors x which achieve 
this probability bound. One then divides the unit Euclidean sphere in W 1 
into classes of vectors according to their "sparsity", and uses the entropy- 
concentration tradeoff for each class separately. This general line is already 
present in Latala's argument [17], and it was developed extensively in the 
recent years, see e.g. [SUl ESI EI]- This argument is presented in Section HI 
where it leads to a useful estimate for norms of sparse matrices, Corollary 14.91 
With this in hand, one can quickly finish the proof of Theorem 11.11 

Acknowledgement. The author is grateful for the referee for careful read- 
ing of the manuscript, and for many suggestions which greatly improved the 
presentation. 

2. Preliminaries 

2.1. Notation. Throughout the paper, the results are stated and proved over 
the field of real numbers. They are easy to generalize to complex numbers. 

We denote by C,C\,c,c\. . . positive absolute constants, and by C(e), Ci(e), . . . 
positive quantities that may depend only on the parameter e. Their values can 
change from line to line. 

The standard inner product in R n is denoted (x, y). For a vector x G M n , we 
denote the cardinality of its support by ||x|| = \{j : Xj ^ 0}|, the Euclidean 
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norm by ||s|| 2 = C^jZ 2 ) 1 ^ 2 , and the sup-norm by ||:r||oo — niax^- \xj\. The 
unit Euclidean ball in R™ is denoted by = {x : ||x||2 < 1}, and the unit 
Euclidean sphere in R n is denoted by S 1 ™ -1 = {x : ||x||2 = 1}. 

The tensor product of vectors x, y G R n is the linear operator i ® t/ on 1" 
defined clS {x <g) y){z) = (x, z)y for z G R n . 

2.2. Concentration of measure. The method that we carry out in Section [4] 
uses concentration in the Gauss space in combination with constructions of e- 
nets. Here we recall some basic facts we need. 

The standard Gaussian random vector g G R m is a random vector whose 
coordinates are independent standard normal random variables. The following 
concentration inequality can be found e.g. in [TjJJ inequality (1.5)]. 

Theorem 2.1 (Gaussian concentration). Let f : IR m — > R be a Lipschitz 
function. Let g be a standard Gaussian random vector in R m . Then for every 
t > one has 

F(f(g)-Ef(g)>t)<exp(-c t 2 /\\f\\l p ) 

where Co G (0, 1) is an absolute constant. 

As a very restrictive but useful example, Theorem 12.11 implies the following 
deviation inequality for sums of independent exponential random variables 
g\ (which can also be derived by the more standard approach via moment 
generating functions). 

Corollary 2.2 (Sums of exponential random variables). Let d = (di, . . . , d m ) 
be a vector of real numbers, and let g±, . . . , g m be independent standard normal 
random variables. Then, for every t > we have 

m 1/2 

> \\dh + t) < exp(-c t 2 /||rf||L). 

i=i 

Proof. The function f(y) = (52iLi dfyf) 1 ^ 2 is a Lipschitz function on R m with 
H/Hup = IMIloo- Moreover, Holder's inequality implies that 

mg) = e(e«) 1/2 - ( E E«) V2 = lldh - 

i=l i=l 

Theorem 12.11 completes the proof. □ 

Another classical deviation inequality we will need is Bennett's inequality, 
see e.g. [9j Theorem 2]: 
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Theorem 2.3 (Bennett's inequality). Let X±, . . . , X^ be independent mean 
zero random variables such that \X^\ < 1 for all i. Consider the sum S = 
X\ + • • • + Xn and let a 1 := Var(S'). Then, for every t > we have 

F{S > t) < exp ( - a 2 h{t/a 2 )) 

where h{u) — (1 + u) log(l + u) — u. 

We will also need M. Talagrand's concentration inequality for convex Lip- 
schitz funcitons from [3U1 Theorem 6.6]; see also [TSJ Corollary 4.10] and the 
discussion below it. 

Theorem 2.4 (Concentration of Lipschitz convex functions). Let X\, . . . , X m 

be independent random variables such that \Xi\ < K for all i. Let f : R m — > R 
be a convex and 1- Lipschitz function. Then for every t > one has 

F(\f(X 1 , ...,X m )- Ef(X 1 , ...,X m )\>Kt)< 4exp(-t 2 /4). 

2.3. Nets. Consider a subset U of a normed space X, and let e > 0. Recall 
that an e-net of U is a subset H olU such that the distance from any point 
of U to M is at most e. In other words, for every x E U there exists y E M 
such that ||s — y||x < e. 

The following estimate follows by a volumetric argument, see e.g. the proof 
of Lemma 9.5 in [T9"] . 

Lemma 2.5 (Cardinality of e-nets). Let e G (0, 1). TTie unit Euclidean ball 
B% and the unit Euclidean sphere S n ~ l in R n both have e-nets of cardinality 
at most (1 + 2/e) n . 

When computing norms of linear operators, e-nets provide a convenient 
discretization of the problem. We formalize it in the next proposition. 

Proposition 2.6 (Computing norms on nets). Let A : X — > Y be a linear 
operator between normed spaces X and Y , and let M be an e-net of either the 
unit sphere S(X) or the unit ball B(X) of X for some e G (0, 1). Then 

\\A\\ < sup ||v4a;||y. 

1 — £ x&M 

Proof. We give the proof for an e-net of the unit sphere; the case of the unit 
ball is similar. Every z G S(X) has the form z — x + h, where x G M and 
ll^-IU < £• Since \\A\\ = sup zeS ^ \\Az\\ Y , the triangle inequality yields 

||j4|| < sup ||Ae||y + sup ||A/i||y. 

^ \\h\\x<e 

The last term in the right hand side is bounded by e||A||. Thus we have shown 
that 

(l-e)||A|| < sup||Ac|| y . 
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This completes the proof. □ 



2.4. Symmetrization. We will use the standard symmetrization technique 
as was done in [17]; see more general inequalities in e.g. [191 Section 6.1]. To 
this end, let the matrices A = (a^) and B be as in Theorem ll.il Let A' = (oL) 
be an independent copy of A, and let be independent symmetric Bernoulli 
random variables. Then, by Jensen's inequality, 

E\\BA\\ = E\\B{A - EA')\\ < E\\B(A - A')\\ 

= E||S(e y (a - - a' i3 ))\\ < 2E||S(e -a -)||. 

Therefore, we can assume without loss of generality in Theorem 11.11 that 
are symmetric random variables. Furthermore, let be independent standard 
normal random variables. Then, again by Jensen's inequality, 

E\\B(g tJ a tJ )\\ = E||£^y|</iiki)|| > E||S( £ii E(|^|)ay)|| 
= (2/n) 1 / 2 E\\B(e tJ a tj )\\. 

Therefore 

(2.1) E\\BA\\ < (27r) 1 / 2 E\\B(g ij a ij )\\. 

Conditioning on a^-, we thus reduce the problem to random gaussian matrices. 

We will use a similar symmetrization technique several times in our argu- 
ment. In particular, in the proof of Lemma 13.81 we apply the following ob- 
servation, which can be deduced from standard symmetrization lemma ( |19j 
Lemma 6.3) and the contraction principle ([19j Theorem 4.4). For the reader's 
convenience we include a direct proof. 

Lemma 2.7 (Symmetrization). Consider independent mean zero random vari- 
ables Zij such that \Z±j\ < 1, independent symmetric Bernoulli random vari- 
ables Eij, and vectors Xij in some Banach space, where both i and j range in 
some finite index sets. Then 



E max 1 1 Z 



ij Xij 



< 27.. max 



Proof. To be specific, we can assume that both indices i and j range in the 
interval {1, . . . , n} for some integer n. Let (Z^) denote an independent copy of 
the sequence of random variables {Zij). Then Z^ — Z[j are symmetric random 
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variables. We have 
Emax ZijXij < Emax ^J(Zy — EJ5y):cy (since EZ^- = 0) 



ijJ X ij 



< Emax y (Zij — Z\ 

i 

= Emax \ £ij(Zij — Z[ 
j II t-r 1 

< 2 max E max |S e 

|ay|<l j II 



ijJ X ij 



(by Jensen's inequality) 
(by symmetry) 



where the last line follows because \Z^ — Z' iA \ < \Za \ + \ Z'-\ < 2. The function 



on 



i a ij)lj=i ^ Emax 



is a convex function. Therefore, on the compact convex set [—1,1]™ it attains 
its maximum on the extreme points, where all ay = ±1. By symmetry, the 
function takes the same value at each extreme point, which equals 



Emax e 



This completes the proof. 



□ 



2.5. Truncation and conditioning. We will need some elementary obser- 
vations related to truncation and conditioning of random variables. 

Lemma 2.8 (Truncation). Let X be a non-negative random variable, and let 
M > 0, p > 1. Then 

EX p 



EX1 



{X>M} 



< 



MP- 1 ' 

Proof. Indeed, 

EX1 {X > M} < EX(X/M) p - 1 l {x > A/} < EX p /M- 
The Lemma is proved. 



p-i 



□ 



We will also need two elementary conditioning lemmas. In Section HI we will 
need to control the maximal magnitude of the entries M = maxy |a y -| of the 
random matrix A. Conditioning on Mq will unfortunately destroy the inde- 
pendence of the entries. So, we will instead condition on an event {Mo < t] for 
fixed t, which will clearly preserve the independence. This conditional argu- 
ment used in the proof of Corollary 14. 1 II relies on the following two elementary 
lemmas. 
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Lemma 2.9. Let X be a random variable and K be a real number. Then 

E(X | X < K) < EX. 

Proof. By the law of total probability, 

EX = E(X | X < K) P(X < K) + E(X | X > K) P(X > K). 

Thus EX is a convex combination of the numbers a = E(X | X < X) and 
6 = E(X | X > K). Since clearly a < K < b, we must have a < EX < b. □ 

Lemma 2.10. Lei X , Y be non-negative random variables. Assume there 
exists K, L > such that one has for every t > 1 : 

(2.2) E(X 2 \Y <t)< K 2 t, P(Y > Lt) < ^ 
T/ien EX < CKVL. 

Proof. Without loss of generality we can assume that K — 1 by rescaling X 
to X/K. Thus we have for every t > 1: 

(2.3) EX 2 l {y < t} < E(X 2 |F < t) < t, 
We consider the decomposition 

oo 

EX = EXl{y< L } + y^EXl{ 2 fc-iL<y<2 fc L}- 

k=l 

By (12. 3 p and Holder's inequality, the first term is bounded as 
EXl {y < L} < (EX 2 l {y < L} ) 1/2 < VI. 

Further terms can be estimated by Cauchy-Schwarz inequality and using (12. 3 h 
and the second inequality in (12. 2p . Indeed, 

EXl| 2 fc-ii<y<2fcL} = EXl{y< 2 fcL}l{y >2 fc-iL} 

<(EX 2 l {y < 2 , L} ) 1/2 (P{F>2 fe - 1 L}) 1/2 
<( 2 fe L) 1 / 2 -^ T = v / L2 1 - fc / 2 . 



Therefore 



ex <Vl + J2^l 2 l ~ k/2 < cVI. 

k=l 

This completes the proof. □ 
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2.6. On the deterministic matrix B in Theorem 11.11 We start with two 
initial observations that will make our proof of Theorem 11.11 more transparent. 
By adding an appropriate number of zero rows to B or zero columns to A we 
can assume without loss of generality that n = m, thus B is an n x N matrix. 

Throughout the proof of Theorem II. 1[ we shall denote the columns of such 
a matrix B by B\, . . . , B^. They are non-random vectors in W 1 , which satisfy 

N 

(2.4) maxilla < < 1; ^ \\B t \\ 2 2 = \\B\\^ S < n\\B\\ < n 

i=l 

where || • ||hs denotes the Hilbert-Schmidt norm. Throughout the argument, we 
will only have access to the matrix B through inequalities (12 .4p . This explains 
Remark 2 following Theorem 11.11 which states that the range space of B is 
irrelevant as long as we control the spectral and Hilbert-Schmidt norms of B. 



3. Approach via M. Rudelson's theorem 

3.1. M. Rudelson's theorem. Our first approach, which will yield Theo- 
rem 11.11 up to a logarithmic factor, rests on the following result. Here and 
thereafter, by £i,£2, ••■ we denote independent symmetric Bernoulli random 
variables, i.e. independent random variables such that P(£j = ±1) = 1/2. 

Theorem 3.1 (M. Rudelson [22]). Let ui, . . . , um be vectors in IR m . Then, for 
every p > 1, one has 



M P i/p M 

m) ■ max ||ttj|| 2 ■ ® Uj 

8=1 ' i I 



1/2 



In particular, for every t > 0, with probability at least 1 — 2me ct one has 

M M 

~ Ui 



x ^ 1/2 

2^£iUi®Ui < t ■ max \\Ui\\2 ■ \\ / j Uj ® 
i=i ' i=i 



The first estimate is taken from [22l inequality (3.4)]. The second estimate 
can be easily derived from it using the following elementary lemma: 

Lemma 3.2 (Moments and tails). Suppose a non-negative random variable X 
satisfies for some m > 1 that 



(KX p ) 1 / p < y/p + A/logm for every p > 1. 

Then 

W(X > t) < 2me" c * 2 for every t > 0. 
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Proof. Suppose first that t > max(l, y/\ogm). Let p := t 2 . Then y/p > 
yTogm, so the hypothesis gives (EX p ) 1 / p < 2y/p. By Markov's inequality, 



P(X > 2et) = P(X P > (2et) p ) < 



(2ei)i 



Next, if t < max(l, \J\og m) then by choosing the absolute constant c > 
sufficiently small right hand side of ( 13. ip is larger than 1 for a sufficiently 
small absolute constant c . Therefore, for every t > one has 

(3.1) P(X > 2et) < 2me"* 2/2 



because if t < max(l, \/log m) then the right hand side of (13.1 ft is larger than 
one, which makes the inequality trivial. This completes the proof. □ 

The next lemma is a consequence of M. Rudelson's Theorem 13.11 and a 
standard symmetrization argument. 

Lemma 3.3. Let X 1; . . . ,X n be independent random vectors in R m such that 



(3.2) 
Then 



\EX~ 



Xj\\<1 



for every j. 



E \\J2 X 



< Cn + C\og{2m) Emax ||X 



2 

J 1 1 2 - 



Proof. Let ei,...,e n be independent symmetric Bernoulli random variables. 
By the triangle inequality, the standard symmetrization argument (see e.g. 
[T9l Lemma 6.3]), and the assumption, we have 



EX, 

3=1 



X, 



n. 



E := E X i ® X i - E ® X i ~ EX i ® ^ 

< 2E|| J^Xj ® X 
j=i 

Condition on the random variables X%, . . . , X n , and apply Theorem 13. 11 Writ- 
ing E e to denote the conditional expectation (i.e. the expectation with respect 
to the random variables e\, . . . , e n ), we have 

|| n II n 1/2 

E £ £ j x j ® x j < C v / log(2m) • max \\Xj || 2 ■ X j <g> Xj 
i=i 3 j=i 

Now we take expectation with respect to Xi, . . . , X n and use Cauchy-Schwarz 
inequality to get 

E < Cyiog(2m) • (EmaxllX,-^) 172 .^ 1 / 2 + n. 



The conclusion of the lemma follows. 



□ 
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3.2. Theorem 11.11 up to a logarithmic term. We now state a version of 
Theorem 11.11 with a logarithmic factor. 

Proposition 3.4. Let N, n be positive integers. Consider an N x n random 
matrix A whose entries are independent random variables with mean zero and 
4-th moment bounded by 1. Let B be an n x N matrix such that \\B\\ < 1. 
Then 

E\\BA\\ < C^fn log(2n). 

The proof will need two auxiliary lemmas. Recall that B\, ... , B^ denote 
the columns of the matrix B. 

Lemma 3.5. Let ai, . . . , ajv be independent random variables with mean zero 
and 4-th moment bounded by 1. Consider the random vector X in W 1 defined 
as 

N 



X = Y,^Bi- 



i=i 

Then 

E||X||^<n, Var(||X||l) < 3n. 
Proof. The estimate on the expectation follows easily from (12.41) : 

N N 

(3.3) Eiixujj = YsHatmwi < wit ^ n - 

i=l i=l 

To estimate the variance, we need to compute 

TV 

E\\X\\* = E(X,X) 2 = Eia.ajaka^iB^B^iB^Bt). 

i,j,k,l=l 

By independence and the mean zero assumption, the only nonzero terms in 
this sum are those for which i — j;k — I or i — k;j — I or i — l;j — k. 
Therefore 

N N 

E\\X\\i = H&j)\W\l\\Bj\\l + 2 B 3 f 

i,j=l i,j=l 
N N N 

= J2 E «)\\ B ^ + E n*i)n^)\\B t \\i\\B 3 \\i + 2 e^^b^ 

i=l i,j=l i,j=l 

=: I x + h + h- 
By the fourth moment assumption and using (12.41) we have 

N N 

h<J2 W B iWi ^ maxdl^H^ Y \\Bi\\l < n 

i=l i=l 
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Squaring the sum in (13.31) . we see that 



h < (E||X|| 2 ) 2 . 



Finally, since by Cauchy-Schwarz inequality E(a 2 a|) < yE(a*)E(a*) < 1, and 
using ( 12. 4 p again, we obtain 

N 

h < 2Y,(B l ,B J ) 2 = 2||B*B||^ S < 2||fi*|| 2 || J B||^ s = 2||5|| 2 || J B|| 2 S < 2n. 

Putting all this together, we obtain 

Var(||X|| 2 ) = E\\X\\% - (E||X|| 2 ) 2 < h + I 3 < 3n. 
This completes the proof. □ 

Lemma 3.6. Let A and B be matrices as in Proposition^^ Let X 1; . . . , X n e 

W l denote the columns of the matrix BA. Then 

E max ll-XjUa < Cn. 

j=l,...,n 

Remark. This result says that all columns of the matrix BA have norm 0(yfn) 
with high probability. Since the spectral norm of a matrix is bounded below by 
the norm of any column, this result is a necessary step in proving our desired 
estimate \\BA\\ = 0(s/n). 

Proof. Let, as usual, Bi, . . . , Bn £ M n denote the columns of the matrix B, 
and let denote the entries of the matrix A. Then 

N 

(3.4) Xj = ^ a ij B ^ j = 1, . . . , n. 

Let us fix j G {1, . . . , n) and use Lemma 13.51 This gives 

(3.5) E||Xj 2 <n, Var(||X,-|| 2 ) < 3n. 

Now we use Chebychev's inequality, which states that for a random variable 
Z with a 2 = Var(Z) and for an arbitrary k > 0, one has 

¥>(\Z-EZ\ > ka) < —. 

K 

Let t > be arbitrary. Using Chebychev's inequality along with (I3.5P for 
Z = \\XjW 2 ,, k = t\/n, we obtain 



P(||XJ*>(l + >/3f)n) <i- 



t 2 n 

Taking the union bound over all j — 1, . . . , n, we conclude that 
P( max ||X 7 -|| 2 > (1 + V3t)n) < n ■ — = — . 

y j=l,...,n" J >" ' t 2 n t 2 



Integration completes the proof. 
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Proof of Proposition \3.4\ Let X±, . . . , X n e M. n denote the columns of the ma- 
trix BA. We are going to apply Lemma 13.31 In order to check that condition 
(13.21) holds, we consider an arbitrary vector x G S 1 ™ -1 and use representation 
(13.41) to compute 

N N N 

E(X v x) 2 = E(j2"ij(Bi,x)) = J2 E ( a D( B ^ x ) 2 ^ E^' x > 2 

i=l i=l i=l 

= \\B*x\\ 2 2 < \\B*\\ 2 = \\B\\ 2 < 1. 
This shows that condition (13. 2p holds. Lemma [3.31 then gives 



E||£A|| 2 = E 2^Xj ®Xj 



< Cn + C\og(2n)E max ||X,-||^ 

j=l,...,n 

Estimating the maximum in the right hand side using Lemma [3761 we conclude 
that 

EUS^II 2 < dnlog(2n). 
This completes the proof. □ 

3.3. Tradeoff between the matrix norm and the magnitude of entries. 

We would like now to gain more control over the logarithmic factor than we 
have in Proposition 13.41 Our next result establishes a tradeoff between the 
logarithmic factor and the magnitude of the matrices A, B. It will be used in 
the proof of Theorem 13.91 

Proposition 3.7. Let a, b > and N,n be positive integers. Let A be an 
N x n matrix whose entries are random independent variables aij with mean 
zero and such that 

Ea?- < 1, lay | < a for every 
Let B be an n x N matrix such that \\B\\ < 1, and whose columns satisfy 

II -Bill 2 ^ b for every i. 

Then 

E\\BA\\ < C(l + ab 1/2 log 1/4 (2n)) v / ^. 



The proof will again be based on M. Rudelson's Theorem I3.1[ although this 
time we use Rudelson's theorem in a more delicate way: 



Lemma 3.8. Under the assumptions of Proposition 3.1 , we have 
E max II V a^Bi <g> B { < C(l + a 2 b^\og{2n)). 

7=1, ...,n ' ^ J 
i=l 
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Proof. Fix j G {1, . . . ,n}. Let /i 2 - := Ea 2 -. By the triangle inequality, 



N 



"13 
N 



N 



(3.6) 



8=1 



1=1 



1=1 



Since < < 1 and 
(3.7) 
we have 
(3.8) 



N 



i=i 



< iisir < 1, 



N 



N 



8=1 



i=l 



< 1. 



Next, clearly //?• < a 2 , so 



e(4--40 



0. 



4-4-1 ^ 2a • 



Symmetrization Lemma 12.71 yields 



A' 



N 



(3.9) E max > (a 2 - — 4 

S 4 < 2a 2 E max > £j 3 '-Bj <8> -Bj 

j=l,...,n || ^— ' J J j=l,...,n || ^— ' 



i=l 



i=l 



where £y denote independent symmetric Bernoulli random variables. 

Let t > 0. By the second part of M. Rudelson's Theorem 13.11 and taking the 
union bound over n random variables, we conclude that, with probability at 
least 1 — 2n 2 e~ ct , we have 



N 



N 



max > £iiBi®Bi <t- max ll-BjlU • > Bi®Bi 

i=l,...,n II ^- — ^ J i=l,...,N \\ 



i=l 



i=l 



1/2 



< tb 



The second estimate follows from (13. 7p and since maxj ||-Bj||2 < b by the hy- 
pothesis. 

Let s > be arbitrary. We apply the above estimate for t chosen so that 
2n 2 e~ ct = e~ s . This shows that, with probability at least 1 — e~ s , one has 

max y^EijBi <g> Bi <tb< Cib(y/log(2n) + s). 

j=l,...,n || f 
i=l 

Integration implies that 

N 

E max II y^eijBi <g> B { < C 2 b^log(2n). 

7=1, ...,n ' J 
t=l 

Putting this into (13. 9p and, together with (13.81) . back into (I3.6p . we complete 
the proof. □ 
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Proof of Proposition ~37h By the symmetrization argument (see (12. we can 
assume that the entries of the matrix A are g^a^, where random vari- 

ables satisfying the assumptions of the proposition, and are independent 
standard normal random variables. We will write E fl , F g when we take ex- 
pectations and probability estimates with respect to {g^) (i.e. conditioned on 
(oij)), and we write E a to denote the expectation with respect to 
By Lemma 13. 8^ the random variable 

N 

K 2 := max af.Bi <8> B { 

7 = 1 n II ^ 3 



a 



■'J , 



1=1 



which does not depend on the random variables (gij), has expectation 



(3.10) E a (K 2 ) < C(l + a 2 b^\og{2n)). 

We condition on the random variables (a^); this fixes a value of K. 
Let X\, . . . , X n G W 1 denote the columns of the matrix BA; then 

N 

Xj = ^2 .'A/''./ /)> - j = 1, . . . , n. 

i=i 

Consider a (l/2)-net A/" of the unit Euclidean sphere S™ -1 of cardinality |JV| < 
5™, which exists by Lemma [2.51 Using Proposition 12.61 we have 

n 

(3.11) ||fiAf = \\(BA)*\\ 2 < 4m&x\\{BA)*x\\ 2 2 =4max^(X j ,x) 2 . 

ie xe j=i 

Fix x G M . For every j = 1, . . . , n, the random variable 

TV 

(Xj,x) = ^gijiaijB^x) 
i=i 

is a Gaussian random variable with mean zero and variance 

N N 

< K 2 . 

*j • - 

i=l i=l 

(To obtain the first inequality, take the supremum over x G S 1 ™ -1 ). Therefore, 
by Corollary 12.21 with di = (Var(Xj, x)) 1 ^ 2 < K, we have for every t > 0: 



^2(aijBi, x) 2 < II a % B i ® B ^ 



Let s > be arbitrary. The previous estimate for t = sK^/n gives 
P 9 {(^(X,,x) 2 ) 1/2 > (1 + S )^} < e- coA \ 

3=1 
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Taking the union bound over x G TV and using ( 13. lip , we obtain 

P fl {||£A|| > 2(1 + s)Ky/n) < |yV| e - cos2n = 5 n e- cos2n < e {2 - cos2)n . 
Integration yields 

EJBA|| < CK^i. 

Finally, we take expectation with respect to the random variables (ay) and 
use ( 13.1 Op to conclude that 

E\\BA\\ < CE a (K)^/E <Cx(l + a 2 6 v / log(2n)) 1/2 v ^. 
This completes the proof. □ 

3.4. Theorem 11.11 for logarithmically small columns. Our next step is 
to combine Propositions 13.41 and 13.71 and obtain a weaker version of the main 
Theorem ll.il - this time with the correct bound 0{-\/n) on the norm, but under 
the additional assumption that the columns of the matrix B are logarithmically 
small. 

Theorem 3.9. Let e G (0,1) and let N,n be positive integers. Consider an 
N x n random matrix A whose entries are independent random variables with 
mean zero and (4 + e)-th moment bounded by 1. Let B be an n x N matrix 
such that \\B\\ < 1, and whose columns satisfy for some M > 1 that 

\\Bi\\2 < Mlog~2~~(2n) for every i. 

Then 

E\\BA\\ < CM l/2 ^/n. 

Proof. By the symmetrization argument described in Section [2], we can assume 
without loss of generality that all entries ay of the matrix A = (ay) are 
symmetric random variables. Let 

a := log2?(2n). 

We decompose every entry of the matrix A according to its absolute value as 

(lij := &y l{|a 4j |<a}) Qy := a ij^-{\a i:j \>a}- 

Then all random variables ay and 5y have mean zero, and we have the following 
decomposition of matrices: 

BA = BA + BA, where A = (ay), A = (ay). 

The norm of BA can be bounded using Proposition I3.4L Indeed, by the 
Truncation Lemma [2.81 with p = 1 + e/4, we have 

Wa% = Eajl {a 4 >0 4 } < — f - < a" £ , 

J (1 
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where the last inequality follows from the moment hypothesis. Therefore, the 
matrix a e A satisfies the hypothesis of Proposition 13.41 which then yields 

E||Si|| < CaT e yJn\og(2n) = Cy/n. 

The norm of BA can be bounded using Proposition 13.71 which we can apply 
with a as above and b = M\og~^~~ (2n). This gives 

E\\BA\\ < C(l + ab 1/2 log 1/4 (2n))v^ < 2CM 1/2 ^, 

where the last inequality follows by our choice of a and b. 

Putting the two estimates together, we conclude by the triangle inequality 
that 

E\\BA\\ < E\\BA\\ + E\\BA\\ < C'M 1,2 ^n~. 
This completes the proof. □ 

Remark. The factor M 1//2 in the conclusion of Theorem 13.91 can easily be im- 
proved to about M £ l 2 by choosing a = tlog^(2n) in the proof and optimizing 
in t. We will not need this improvement in our argument. 

4. Approach via concentration 

In this section, we develop an alternative way to bound the norm of BA, 
which rests on Gaussian concentration inequalities and elaborate choice of e- 
nets. The main technical result of this section is the following theorem, which, 
like Theorem 13.91 gives the correct bound 0(^Jn) under some boundedness 
assumptions on the entries of A. 

Theorem 4.1. Let e e (0, 1), M > 1 and let N > n be positive integers such 
that log(2iV) < Mn. Consider an N x n random matrix A whose entries are 
independent random variables with mean zero and such that 

i2+f ( Mn \ 

E\ aij \ < 1, |ay| < { hg ( 2N ^ ) f° r every 

Let B be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < C(e)VMn 
where C(e) depends only on e. 

Remarks. 1. If the entries have bounded (4 + e)-th moment, it is easy to 

check that max^ ~ (nN)^ holds with high probability. Therefore, under 
the (4+e)-th moment assumption, the hypotheses of Theorem 14. II are satisfied 
for almost square matrices, i.e. those for which iV < n 1+C£ . This will quickly 
yield the main Theorem 11.11 for almost square matrices, see Corollary 14.111 
below. 
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2. The hypotheses of Theorem 14.11 are almost sharp when N ~ n. Indeed, 
let us assume for simplicity that the random variables identically dis- 
tributed and B is the identity matrix. The (2 + e)-th moment hypothesis is 

almost sharp: if Ea| > 1 then (E||A|| 2 ) 1/2 > Q||A||| S ) 1/2 > y/n. Also, the 
boundedness hypothesis is almost sharp, since \\A\\ > maxjj |ay|. 

3. Using M. Talagrand's concentration result, Theorem 12 A\ one can also 
obtains tail bounds for the norm 115 A II: 



Corollary 4.2. Under the assumptions of Theorem \4-l\ one has for every 
t > 0: 

F(\\BA\\ > (C(e) + t)v / Mn) < 4e" t2/4 . 
In particular, one has for every q > 1 : 

(E\\BA\\^ < C (e) V^Mn. 

Proof. We can consider the N xn matrix v4 as a vector in K JVn . The Euclidean 
norm of such a vector equals the Hilbert- Schmidt norm ||A||hs- Since ||-BA|| < 
||5||||A|| < 1 • ||A||hs, the function / : R Nn ->■ R defined by f(A) = \\BA\\ 
is 1-Lipschitz and convex. Since we have ja^l < a/ Mn for all z,j by the 
assumptions, M. Talagrand's Theorem 12.41 gives 



F(\\BA\\ - E\\BA\\ > ty/Mn) < 4e~* /4 , t > 0. 
The estimate for E||£L4.|| in Theorem 14.11 completes the proof. □ 

4.1. Sparse matrices: rows and columns. Theorem 14.11 will follow from 
our analysis of sparse matrices. We will decompose the entries according 
to their magnitude. As the magnitude increases, the moment assumptions will 
ensure that there will be fewer such entries, i.e. the resulting matrix becomes 
sparser. 

We start with an elementary lemma, which will help us analyze the magni- 
tude of the rows and columns of the matrix BA when A is a sparse matrix. 

Lemma 4.3. Let N, n be positive integers. Consider independent random 
variables aij, i = 1, . . . , N, j = 1, . . . , n. Let p G (0, 1], and suppose that 

Ea?- < p, \a,ij\ < 1 for every 

Let B be an n x N matrix such that \\B\\ < 1, whose columns are denoted Bi. 
Then 



(4.1) E max a% < C{np + log(2JV)), 

* =1 ''"' i=i 

TV 

(4.2) E max V < C(np + log(2n)). 

7=1, ....re ' * 



1=1 
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Remark. The test case for this lemma, as well as for most of the results that 
follow, is the random variables with values in { — 1,0,1} and such that 
P(djj 7^ 0) = p. The N x n random matrix A = (a^) will then become sparser 
as we decrease p; it will have on average np nonzero entries per row. Estimate 
(14. ip gives a bound on the Euclidean norm of all rows of A. 

Proof. We will only prove inequality (14. 2\i : the proof of inequality (14. ip is 
similar. By the assumptions, we have 

Var(a 2 ) < Ea^- < Ea?- < p for every i, j. 

Also, recall that (12 .4p gives 

TV N N 

||-Bj||2 < U, I|-S«ll2 — m aX ||Sj||2 • ||-Bi||2 — n - 

i=l i=l i=l 

Consider the sums of independent random variables 

N 

s i '■= ^2 a %\\ B i\\li j = l,...,n. 

i=l 

The above estimates show that for every j we have 

N N 

ESj = J]E(4) II^U* < np, Var(^) = ^ Var(aJ) \\B t \\\ < np. 
i=i i=i 

We apply Bennett's inequality, Theorem 12 .3[ for Xi = \{a 2 - — Ea|-)||5j|||, 
which clearly satisfy \Xj\ < 1 because \a^\ < 1 and ||-Bj|| 2 < 1 by (I2.4p . We 
obtain 

(4.3) ^{\{Sj - ESj) > t} < exp ( - a 2 h(t/a 2 )) 

where E(|5 , J -) < np and a 2 = Vax(^Sj) < np. Note that h(x) > cx for 
x > 1, where c is some positive absolute constant. Therefore, if t > np, then 
a 2 h{tja 2 ) > ct, so (@~3]) yields 

P{5j > 2t} < e~ ct for t > np. 

Taking the union bound over all j, we conclude that 

P{ max Sj >2t\ < ne~ ct for t > np. 

Now let s > 1 be arbitrary, and use the last inequality for t = (np + log(2n))s. 
We obtain 

P{ max Sj > 2{np + log(2n))s| < ne - clog(2n)s = 2~ cs n 1 - cs . 

j=l,...,n 

Integration yields 

E max Sj < C(np + log(2n)). 

j=l,...,n 
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This completes the proof of (14.21) . 



□ 



The estimates in Lemma 14.31 motivate us to consider the class of N x n 
matrices A = (a#) whose entries satisfy the following inequalities for some 
parameters p 6 (0, 1] and K > 1: 



max | ay | < 1 ; 



^ 1 /2 

. max^ ( 4 ) ^ K v^P + log(2iV); 



(4.4) 

max f y^Qjjll-Billa) < K^Jnp + log(2n). 



3=1 
N 



i=l 



We have proved that for random matrices whose entries satisfy |ay| < 1 and 
— Pi conditions (14. 4 ft hold with a random parameter K that satisfies 
EK < C. 



4.2. Concentration for a fixed vector. Our goal will be to estimate the 
magnitude of for matrices of the form A = (gijCiij), where gij are in- 

dependent standard normal random variables, and fixed numbers that 

satisfy conditions (I4.4p . Such an estimate will be established in Proposition 14.81 
below. By the standard symmetrization, the same estimate will hold true if 
A = (aij) is a random matrix with entries as in Lemma I4T31 This will be done in 
Corollary 14.91 Finally, Theorem 14.11 will be deduced from this by decomposing 
the entries of a random matrix according to their magnitude. 

Our first step toward this goal is to check the magnitude of ||.BAr||2 for a 
fixed vector x. 

Lemma 4.4. Let N, n be positive integers. Consider an N x n random matrix 
A = (gijOij) where g^ are independent standard normal random variables and 
are numbers that satisfy conditions (I4.4p . Let B be an n x N matrix such 
that ||Z?|| < 1. Then, for every vector x G we have 

¥\\BAx\\ 2 < K^/np + \og(2n). 
Proof Denoting as usual the columns of B by Bi, we have 

N n 

BAx = ^ ( y^!l'j"'.r r .i) !>'■ 

i=l j=l 
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Since \\x\\2 < 1 and using the last condition in ( j4.4p . we have 

TV n 

nBAx\\\ = Y,Y. a l x *\\ B ^ 

i=i j=i 

n N 



j=l i=l 

N 



< max alWBiWl < K 2 (np + \og(2n)) . 

7=1, ...,n ^— ' J 



8=1 



This completes the proof. □ 

We will now strengthen Lemma T4.4I into a deviation inequality for ||St4x||2. 
This is a simple consequence of the Gaussian concentration, Theorem 12. II This 
deviation inequality is universal in that it holds for any vector x; in the sequel 
we will need more delicate inequalities that depend on the distribution of the 
coordinates in x. 



Lemma 4.5 (Universal deviation). Let A and B be matrices as in Lemma \J^ 

Then, for every vector x G B^ and every t > we have 

(4.5) P{||£Ar|| 2 > Ky/np + log(2n) + t] < e 

Proof. As in the proof of Lemma 14.41 we write 



c t z 



N n 



BAx = ^2 (^2 9ij a ij x j) Bj 



i=i j=i 

where Bi are the columns of the matrix B. Therefore, the random vector BAx 
is distributed identically with the random vector 

N n 1 , 2 

^giXiBi, where Xi = ( a^xf) 
i=i j=i 

and where ^ are independent standard normal random variables. Since all 
\ a ij\ < 1 by conditions (I4.4p . and ||x|| 2 < 1 by the assumptions, we have 

0<Ai<l, i = l,...,N. 

Consider the map / : R — > R given by 

N 



f{y) = \\^2yi^iBi 



i=l 
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Its Lipschitz norm equals 



1/2 A 



l ' 2 n n 

< 1 ■ LB < 1. 



i Lip = I x2 i Bi ® Bi - m f x l Ai ' ' || Yl Bi ® B 

i=l ' i=l 

Then the Gaussian concentration, Theorem 12.11 gives for every t > 0: 

P(/((7)-E/^)>t)<exp(-Cot 2 ), 

where g = (g±, . . . , g^). Since as we noted above, f(g) is distributed identically 
with ||I?Ar||2, Lemma [4.41 completes the proof. □ 

4.3. Control of sparse vectors. Since the spectral norm of BA is the supre- 
mum of ||5Ac|| 2 over all x G S n ~ l , the result of Lemma 14.51 suggests that 
E||IL4|| < \Jnp + log iV should be true. However, the deviation inequality in 
Lemma I4T51 is not strong enough to prove this bound. This is because the met- 
ric entropy of the sphere, measured e.g. as the cardinality of its |-net, is e cn . 
If we are to make the bound on ||5Aa;||2 uniform over the net, we would need 
the probability estimate in (14. 5 p at most e~ cn (to allow a room for the union 
bound over e cn points x in the net). This however would force us to make 
t ~ y/n or larger, so the best bound we can get this way is E||£M|| 2 < \fn. 
This bound is too weak as it ignores the last two assumptions in (14.41) . 

Nevertheless, the bound in Lemma 14.51 can be made uniform over a set of 
sparse vectors, whose metric entropy is smaller than that of the whole sphere: 



Proposition 4.6 (Sparse vectors). Let A and B be matrices as in Lemma^J 



There exists an absolute constant c > such that the following holds. Consider 
the set of vectors 

B 2fi := jx G R", ||x|| 2 < 1, ||x|| < cnp/\og(e/p)y 

Then 

E sup ||.RA2;||2 < ZK^/np + log(2n). 

z£-B 2 ,o 

Proof Let c > be a constant to be determined later, and let A := cp/ log(e/p). 
Then 

B 2 ,o = [J B 2 , 

\J\ = [Xn\ 

where the union is over all subsets J C {!,..., n} of cardinality [-^J; an d 
where B^ = {x G M J : ||x||2 < 1} denotes the unit Euclidean ball in IR J . By 
Lemma |2~5"| B^ has a |-net Nj of cardinality at most e 2Xn . Let t > 1. For a 
fixed x G A/j, Lemma 1431 gives 



PjHEAxHa > (K + l)^/np + log(2ri) + t] < exp ( - c (np + t 2 )) . 
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Using Proposition 12.61 and taking the union bound over all x G we obtain 
sup \\BAx\\ 2 > (K + l)y/np + \og(2n)+t} 

2 x^Bl 



< P{ sup ||-BAc|| 2 > (K + 1) y/np + log(2n) + 1 } 

< |A/j| exp ( — c (np + t 2 )) < exp (2An — c (np + t 2 )) . 

Since there are (m^i) < (e/A) An ways to choose the subset J, by taking the 
union bound over all J we conclude that 

(4.6) P{- sup \\BAx\\ 2 > 2(K + 1) y/np + log(2n) + t} 

2 ieB 2 ,o 

< exp (A log(e/ A)n + 2An — c (np + t 2 )) . 

Finally, if the absolute constant c > in the definition of A is chosen sufficiently 
small, we have Alog(e/A)n + 2Xn < c^np. Thus the right hand side of f)4.6p is 
at most 

exp(-c t 2 ). 

Integration completes the proof. □ 

4.4. Control of spread vectors. Although we now have a good control of 
sparse vectors, they unfortunately comprise a small part of the unit ball B^. 
More common but harder to deal with are "spread vectors" - those having 
many coordinates that are not close to zero. The next result gains control of 
the spread vectors. 



Proposition 4.7 (Spread vectors). Let A and B be matrices as in Lemma \J^ 

with N > n. Let M > 2. Consider the set of vectors 

B 2 ,oc ■= \x e R n , \\x\\ 2 < 1, Hxlloo < — }. 



Then 

E sup \\BAx\\ 2 < Clog 3/2 (M) ■ Ky/np + \og(2N). 

Proof. This time we will need to work with multiple nets to account for different 
possible distributions of the magnitude of the coordinates of vectors x G B 2 oo . 
Since \\x\\oq < \\x\\ 2 , without loss of generality we can assume that M < y/n. 

Step 1: construction of nets. Let 

/i fc :=— , k = -2,-l,0,l,2,...,log 2 M 

Jn 



and let 

Af := {x G -82,00 : Vj 3fc such that \xj \ = hk}. 
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A standard calculation shows that Af is an —net of -62,00 i n the -B2 !00 -norm, 
i.e. for every x G -62,00 there exists y G Af such that x — yE |-£>2,oo- Therefore, 
by Proposition 12. 6[ 

sup ||-BAr|| 2 < 2 sup ||-BAr|| 2 . 

Fix x E M. Since \\x\\2 < 1, the number of coordinates of x that satisfy 
\xj\ = hk is at most \_h^ 2 \, for every k. Decomposing x according to the 
coordinates whose absolute value is h k , we have by the triangle inequality that 

log 2 M 

(4.7) sup \\BAx\\ 2 < 2 SU P \\ BA vh, 



£6-82,00 u__ o zeA/j 



v '' k 



where 

M k = {x G El : ||x|| < \ all nonzero coordinates of x satisfy \xj\ = hk}- 

Fix k and assume that Afk 7^ 0- Since ^ < Mj y/n, we have 
(4.8) m := [h k 2 \ > [n/M 2 \ > 1. 

To estimate the cardinality of Afk, note that there are at most min(m, n) ways 
to choose || a: || := I; there are (") ways to choose the support of x; and there are 
2 l ways to choose the (signs of) nonzero coordinates of x. Hence by Stirling's 
approximation and using ( 14.81) . we have 
(4.9) 

min(m,n) , * 

|A4| < ^ fj2 z <min{^) m ,4 n } < (4eM 2 ) m < exp(CmlogM) 
l=1 \ J 171 

where C > 1 is an absolute constant. 

Step 2: control of a fixed vector. Fix m and fix x G A4- As we saw in the 
proof of Lemma I4.5[ 

N 

\\BAxW2 is distributed identically with '^^g i X i B i 



i=i 



where 



x 1/2 



J=l 

and where gi are independent standard normal random variables. Since x G 
Afk, we have ||x||oo = hk < A=. This and the second condition in ( 14. 4ft yield 



a,<[_> a ? .i 1/2 <AV np+log(2Ar) 



i=i 



??? 
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We consider the map / : WL N — > K given by 

N 

f(y) = Wy^yAiBi . 

II 2 
i=l 

Repeating the estimate in the proof of Lemma 14.51 we bound the Lipschitz 
norm as 

/ kip < max Ai < K\ . 

i V m 

Then the Gaussian concentration, Theorem I2.1[ gives for every t > 0: 

P(/to)- E /to)> t )<exp(- ^ (n ;f^ g(2JV)) ), 

where g = (g 1: . . . , g N ). Since as we noted above, /(g) is distributed identically 
with ||5Ar|| 2 , Lemma [4.41 yields that 

HWBMU > AV«P + log(2 n ) + t) < exp ( - g2( jf" 8(2JV)) ), 

Let u > be arbitrary. Applying the above estimate for t = uK\J np + log(2iV) 
and using N > n we conclude that 

(4.10) P(||BAzr|| 2 > (1 + u)K^np + log(2iV)) < exp(— c u 2 m). 

Step 3: union bound. Taking the union bound in (I4.10p over all x e Mk and 
using estimate ( 14. 9 p on the cardinality of Mk, we have for all u > 0: 

P( sup ||-BAe|| 2 > (1 + u)K^Jnp + log(2iV)) < |A4| exp(-c M 2 m) 

< exp(CmlogM — coU 2 m). 

Let s > 1. We choose u = CiSy/\og~M, where C\ := a/C/c . Since it > 1 and 
m > 1, M > 2, we obtain from the above estimate that 

P( sup ||SAc|| 2 > 2C 1 sK v /\og(M)(np + log(2iV))) < exp(C(l - s 2 )mlogM) 

xeAfk 

< exp(c(l - s 2 )). 

Integrating yields that 

E sup ||5Ar|| 2 < C 2 Ky/log(M)(np + log(2iV)). 

Putting this back in (14. 7p . we conclude that 

E sup \\BAx\\ 2 < 2(3 + log M) ■ C 2 K^\og{M)(np + \og{2N)). 

This completes the proof. □ 
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4.5. Norms of sparse matrices, and proof of Theorem 14. 1L Proposi- 
tions 14.61 and 14.71 together handle all vectors in the unit ball, and yield the 
following norm estimate: 



Proposition 4.8. Let A and B be matrices as in Lemma 4-4 with N > n. 
Then 

E\\BA\\ < Clog 3/2 (-) ■ Kx/np + log(2N). 

Proof. Let c be the absolute constant as in Proposition I4.6( we can clearly 
assume that c < 1/4. We define 



M= t/ — log-. 

cp p 

Note that M > 2 as required in Proposition 14.61 

Fix a vector x G B%. We decompose it according to the magnitude of the 
coordinates, as follows: 

x = y + z, y := x ly. |^|>M/Vn}> z := x ly. \ Xj \<M/^}- 

Clearly, ||y||2 < ||x||2 < 1, ||-2||2 < \\ x \\2 ^ 1- By Markov's inequality, we have 
, . . . ,,//—, I n cnp 

Then y G -82,0 as in Proposition 14.61 On the other hand, ||^||oo — M/y/ri by 
definition, so z G -82,00 as in Proposition 14.71 Therefore, by Propositions 14.61 
and 14.71 we have 

E||£M|| = E sup \\BAx\\ 2 <E sup \\BAy\\ 2 + E sup \\BAz\\ 2 

x&B% yeB 2 , zeB 2 ,oo 

< "iKxJnp + log(2n) + C log 3/2 (M) • Kxjnp + log(2iV). 

Our choice of M and the assumption N > n completes the proof. □ 

Finally, a standard symmetrization argument yields the following norm es- 
timate, which we shall use for sparse random matrices. 

Corollary 4.9. Let p G (0, 1] and let N > n be positive integers. Consider an 
N x n random matrix A whose entries are independent random variables 
with mean zero and such that 

E|ajj| 2 < p, \dij\ < 1 for every 

Let B be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < Clog 3/2 (^y/np + log(2N). 

Remark. It would be interesting to remove the logarithmic term from this 
estimate. 
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Proof. Let g^ be independent standard normal random variables. Consider 
the random matrix A = (gijdij). By ( 12. II) . we have 

(4.11) M BA \\ < (2tt) 1/2 E|| 5^11. 

By Lemma 14.31 conditions ( 14. 4 p hold with some random parameter > 1 
which only depends on the random variables (a^) and not on (gij), and which 
satisfies 

(4.12) E a K<d 

where C\ is an absolute constant. Here and below we write E a when the 
expectation is with respect to (ay), and E g if the expectation is with respect 
to (g^). 

Condition on the random variables (a^). Proposition 14.81 then yields 
E g \\BA\\ < Chg i/2 Q • Ky/np + \og(2N). 
Therefore, when we remove the conditioning, we obtain by (I4.12p that 
E||Si|| = E a Ejfli|| < Clog 3/2 (-) • C 1 ^/np + \og(2N). 

This and (14. lip complete the proof. □ 



Proof of Theorem \4-l\ By the standard symmetrization technique described in 
Section (2J we can assume without loss of generality that all symmetric 
random variables. We decompose the matrix A according to the magnitude of 
its entries as follows. Given a subset I C R, we define the truncated matrix 



Consider 



txunc(A,I) = (aijl{\ aij \ e i})- 
A (0) =trunc(A, [0,1]); 

A (k) = 2 -k trunc (^ ) [2 k -\ 2 k }), k = 1,2,... 

Then we have a decomposition A = YlkLo 2 fe ^4 ■ This sum is actually finite 
because of the boundedness assumption on a^-. Indeed, we have 

fco 

(4.13) A = A (0) + J2 2 ^ (fc) 

fc=i 

where k is the maximal integer such that 
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(k) 

Because symmetric random variables, all entries a\j of the matrices 

A™ satisfy EaJ* = and \afX < 1. 

Using Corollary 14.91 for the matrix and p — 1, we obtain 

(4.15) E||SA (0) || < C iy /n + log(2N) < 2C 1 v / M^, 

where the last line follows because log(2iV) < Mn and M > 1 by the hypoth- 
esis. 

Now we fix 1 < k < k . Using the (2 + e)-th moment assumption, we have 
by Markov's inequality that 



,(*) 



+ 0) < P(o ii > 2*" 1 ) < 2~( 2+£ )( fc - 1 ) =: p k . 



This and the bound | < 1 yield E(ajk ) 2 < p^. With this, we apply Corol- 
lary 14.91 for the matrix Aw and obtain 

E\\BA (k) \\ < Clog 3/2 f— ) Jnp k + log(2N). 

By the definition of p k and by flUHJ), we have 

loggAQ 
P k > P k0 > • 

Therefore, np^ + log(2iV) < (1 + M)np k < 2Mnp k , so 

E||5A (fe) || < Cdog 3/2 f — ) y/2Mnp k 
(4.16) < C 2 [1 + (2 + e)(Jfc - i)] 3 / 2 2-( 1 + e / 2 )( fc - 1 ) . VMn. 

Using (14.131) and the triangle inequality, then using (I4.15P and (I4.16J) . we 
conclude that 

fco 

E||BA|| < E||5A (0) || +^2 fe E|| J BA (fc) || 

fc=i 

fco 

< 2Civ / M^+ 5^C 2 [1 + (2 + e)(Jfe - i)] 3/2 2 fe -(i+e/2)(fc-i) . Vml 
fe=i 

oo 

^Csv^-^A; 3 /^-^ 
fc=i 

= C( £ )v / M^. 

This completes the proof of Theorem 14.11 □ 
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4.6. Almost square matrices. The main application of Theorem 14.11 is for 
almost square matrices - those for which N = n 1+0 ^\ The next lemma verifies 
the hypotheses of Theorem 14.11 for such matrices. 

Lemma 4.10. Let e G (0, 1) and let N,n be positive integers satisfying N < 
n i+e/io_ f, e flfl jy x n random matrix whose entries are independent 

random variables with (4 + e)-th moment bounded by 1. Define the random 
variable M by the equation 

(ai>?\ i i ( Mn 

(4.17) max Oj,- = - — . — r- 

v ; ij 1 Jl Vlog(2iV)/ 

Then, for every t > 1, one has 

P(M > C(s)t) < i 
In particular, one has EM < C\ (e) . 

Proof. By Markov's inequality, we have for every i,j that 

p(K-l>a) *>o. 



Let t > 1. We then have 



KM > (t 2 nN)^) < -L. 
Taking the union bound over all nN random variables a^-, we obtain 
(4.18) P(max|a ij | > (t 2 nN)^) < — . 

The assumption N < n 1+£ ^ 10 yields that 

C(e)n \ 2 +^/8 



niV < i j i 

Jog(2AT)y 

Therefore, since 2 t £ ^ 8 < n 1 ,. and t > 1, we have 

' 4+e — 2+£/4 — ' 



v ; ~ Vlog(2iV)/ 



Using this in (I4.18p . we obtain 

P(M> G ( e)i )<p(™x M > (^1^)^)4. 

Integration completes the proof. □ 

We are now ready to state and prove a partial case of Theorem 1 1.1 1 for almost 
square matrices. 
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Corollary 4.11. Let e G (0,1) and let N,n be positive integers satisfying 
N < n l+£ / w . Let A be an N x n random matrix whose entries are independent 
random variables with mean zero and (4 + e)-th moment bounded by 1. Let B 
be an n x N matrix such that \\B\\ < 1. Then 

E\\BA\\ < C{e)^/n. 

Proof. Without loss of generality we may assume that iV > n by adding an 
appropriate number of zero rows to A and zero columns to B. Also, using the 
standard symmetrization, we can assume that the random variables are 
symmetric. Let M be the random variable as in Lemma [4.101 and let t > 1. 
By the definition, {M < t} is the product event. Therefore, conditioning on 
this event (i) preserves the independence of the entries of A; (ii) makes all 
these entries bounded as in (14.171) ; (iii) can only reduce their moments by 
Lemma [2.91 thus for all i,j we have 

E(\ aij \ 2+£ / 4 \M<t)< E\ aij \ 2 +£ / 4 < 1. 

Therefore, we can apply Corollary 14.21 conditionally, with e/4 and with M 
replaced by max(M, 10), which gives 

[E(\\BA\\ 2 \M <t)] 1/2 < C (e)Vtn for t > 1. 
Additionally, by Lemma 14.101 we have 

P(M > C{e)t) <\ for t > 1. 

L 

By Lemma [2.101 this yields 

E\\BA\\ < (E\\BA\\ 2 ) 1/2 < C x (e)y/ri 
as claimed. □ 

5. Completion of the proof of Theorem 11.11 

Proof of Theorem By adding an appropriate number of zero rows to B or 
zero columns to A we can assume that m — n, thus B is an n x iV matrix. 
Consider the exponent 

K = K{e) = 1 + 1. 

As usual, let Bi, . . . , B^ be the columns of the matrix B. Consider the subset 
I C {1, . . . , N} of large columns defined as 

I:={i: \\Bi\\ 2 >C (e)\og- K (2n)}. 

Here we choose Cq(e) sufficiently large so that, by ( 12.41) and Markov's inequal- 
ity, we have 

AT := |/| < C (e)- 2 nlog 2K "(2n) < n l+e/w . 
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Denote by Aj the No x n sub matrix of A whose rows are in J, by Bj the 
n x N submatrix of B whose columns are in / (and similarly for I c ). The 
decomposition BA = BjAj + BjcAjc implies by the triangle inequality that 

(5.1) \\BA\\ < H-BMjII + ||B/cA/c||. 

This splits our problem into two subproblems, one for I and one for J c . Of 
course, if I or I c is empty then the corresponding matrix is zero and we can 
skip its estimation. 

The matrices Aj, Bj are almost square, so Corollary 14.111 applies for them, 
giving 

(5.2) E||S/A/|| < C(e)^n. 

On the other hand, the columns of the matrix Bj c are small by the definition 
oil: 

\\Bi\\ 2 < C (e) log"^(2n) for every i e I c . 
Therefore, Theorem 13.91 applies to the matrices Ajc, Bjc, which gives 

(5.3) EUSjcA/cll < C x (e)y/n. 
Putting estimates (15.21) and (I5.3P into (15.11) . we conclude that 

E\\BA\\ < C 2 (e)V^- 
Theorem 11.11 is proved. □ 
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